At Episource, we have been working on creating a scalable NLP engine for scalable information extraction from medical discharge summaries. What makes the problem statement even more difficult is the lack of quality training data as well as the wide domain expertise needed to succeed in this use-case.
There are many NLP-based solutions in the healthcare industry that claim to be very accurate as well as deliver quick results. However, when such systems are implemented in real production grade scenarios, they end up being low precision, low recall systems affecting productivity as well as hurting company’s bottom line.
When I was brought on board to head the NLP division at Episource, I focused on four things;
1. In-house Training Data to be created in a peer-reviewed 3 level QA system. Our models have to ingest annotated data for better performance, so it was important to ensure quality. We also had to make sure that the data being annotated in encrypted and no patient information is accessible to external parties to ensure HIPPA compliance. We have crossed 20K annotated training data samples, which is a treasure mine for our data-hungry algorithms.
2. Building architectures for deep learning, with more focus on feature engineering and ensemble learning. We have an active interest in monitoring latest research and consume around 30-40 research papers a month to distill knowledge into our NLP engine. This helps us in developing solutions that are proprietary and give the best results. Our algorithms are re-tuned and updated on a regular basis – sometimes completely overhauled for a better algorithm.
The NLP engine deploys complex deep learning techniques, information retrieval algorithms, graph-based technologies and incorporate the best practices of the latest developments in the field of Machine Learning and Natural Language Processing. Many of our deep learning algorithms take days to train, given the task complexity at hand. We also spent a fair bit of time creating taxonomies to distill domain logic and subjective knowledge into a semantic vault to aid in a higher degree of disambiguation and accuracy.
3. High Recall – HIgh Precision models: Our current systems have a false negative rate of less than 1% and false positive rate of less than 10% in identifying coding opportunities – which should translate to more revenue for our clients in the long term, as well as be improving coder productivity. Also, our ICD code lookups are based on graph-based technologies and domain taxonomies that help map relationships and dependencies better.
4. Build production grade code and scalable systems to deploy these models in a reproducible and encrypted fashion – Our technical architectural backends are lean and fast. We can process roughly 250 charts (~50 pages long each) per instance per hour, at a few cents per chart cost. Comparing that to a human who can process no more than 3 charts per hour.
The talk will focus on digging deeper into the four motivations above, as well as explaining some of the constraints that go into building a deep learning based clinical decision support system while remaining on the fair side of legal and business guidelines. We will also talk about our learnings in building annotation pipelines for training data creation
and deep learning frameworks specifically from a point of view of Clinical Named Entity Recognition systems.
I am currently leading the NLP & Data Science practice at Episource, a US healthcare company. My daily work revolves around working on semantic technologies and computational linguistics (NLP), building algorithms and machine learning models, researching data science journals and architecting secure product backends in the cloud.
Techstack that my team and I typically work on includes;
Testing Frameworks: unittest, pytest
Automation & Configuration Management: Ansible, Docker, Vagrant
CI: Travis CI
Cloud Services: AWS, Google Cloud, MS Azure
APIs: Bottle, CherryPy, Flask
Databases: MySQL, SQLite, MSSQL, RDF stores, Neo4J, ElasticSearch, MongoDB, Redis
Editor: Sublime, Pycharm
I have architected multiple commercial NLP solutions in the area of healthcare, foods & beverages, finance and retail. I am deeply involved in functionally architecting large scale business process automation & deep insights from structured & unstructured data using Natural Language Processing & Machine Learning. I have contributed to multiple NLP libraries like Gensim and Conceptnet5. I blog regularly on NLP on multiple forums like Data Science Central, LinkedIn and my blog Unlock Text.
I love teaching and mentoring students. I speak regularly on NLP and text analytics at conferences and meetups like Pycon India and PyData. I have also taught multiple hands-on session at IIM Lucknow and MDI Gurgaon. I have mentored students from schools like ISB Hyderabad, BITS Pilani, Madras School of Economics. When bored – I like to fall back on Asimov to lead me into an alternate reality.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org