Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Building a healthcare decision support system for ICD10/HCC coding through deep learning

Manas R Kar (Episource)
14:0514:45 Wednesday, 23 May 2018
Average rating: ***..
(3.00, 3 ratings)

Who is this presentation for?

  • CEOs, COOs, VPs, AVPs, directors, managers, developers, solution architects, data engineers, and healthcare professionals

Prerequisite knowledge

  • A basic understanding of deep learning, NLP, and cloud platforms
  • Familiarity with healthcare terminology (e.g., ICD10 and HIPPA)

What you'll learn

  • Explore Episource's clinical NER systems and discover why such systems are an important cog in the wheel for current healthcare firms
  • Understand models and techniques that are typically deployed for healthcare NLP, constraints in building a deep learning architecture for healthcare data, and considerations in architecting a scalable backend for processing large quantities of data in an encrypted fashion


There are many NLP-based solutions in the healthcare industry that claim to be very accurate and deliver quick results. However, when such systems are implemented in real production scenarios, they end up being low precision and low recall, affecting productivity and hurting company’s bottom line. The task is difficult due to the lack of quality training data and the wide domain expertise needed to succeed.

Episource is building a scalable NLP engine to help summarize medical charts and extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. Manas Ranjan Kar offers an overview of the wide variety of deep learning algorithms behind Episource’s solution and the complex in-house training-data creation exercises that were required to make it work, focusing on four key motivations for the system. Manas also explains some of the constraints that go into building a deep learning-based clinical decision support system while remaining on the fair side of legal and business guidelines and shares lessons learned building annotation pipelines for training data creation and deep learning frameworks, specifically from the point of view of clinical named entity recognition systems.

Topics include:

  • Creating in-house training data in a peer-reviewed three-level QA system: Episource’s models have to ingest annotated data for better performance, so it was important to ensure quality. To ensure HIPPA compliance, the company also has to make sure that the data being annotated is encrypted and that no patient information is accessible to external parties. Episource has created more than 20K annotated training data samples—a treasure mine for its data-hungry algorithms.
  • Building architectures for deep learning, with more focus on feature engineering and ensemble learning: Episource has an active interest in monitoring the latest research and consumes between 30 and 40 research papers a month to distill knowledge into its NLP engine. This helps the company develop solutions that are proprietary and gives the best results. Algorithms are retuned and updated on a regular basis and are sometimes completely overhauled for a better algorithm. Likewise, the NLP engine deploys complex deep learning techniques, information retrieval algorithms, and graph-based technologies and incorporates best practices from the latest developments in the field of machine learning and natural language processing. Many of the company’s deep learning algorithms take days to train, given the task complexity at hand. Episource also spends a fair bit of time creating taxonomies to distill domain logic and subjective knowledge into a semantic vault to aid in a higher degree of disambiguation and accuracy.
  • High-recall, high-precision models: Episource’s current systems have a false negative rate of less than 1% and a false positive rate of less than 10% in identifying coding opportunities, which should translate to more revenue for its clients in the long term and improve coder productivity. ICD code lookups are based on graph-based technologies and domain taxonomies that help map relationships and dependencies better.
  • Building production-grade code and scalable systems to deploy these models in a reproducible and encrypted fashion: Episource’s technical architectural backends are lean and fast. The company can process roughly 250 charts (each about 50 pages long) per instance per hour, at a few cents per chart cost (compared to a human, who can process no more than three charts per hour).
Photo of Manas R Kar

Manas R Kar


Manas Ranjan Kar is a Associate Vice President at US healthcare company Episource, where he leads the NLP and data science practice, works on semantic technologies and computational linguistics (NLP), builds algorithms and machine learning models, researches data science journals, and architects secure product backends in the cloud. He’s architected multiple commercial NLP solutions in the area of healthcare, food and beverages, finance, and retail. Manas is deeply involved in functionally architecting large-scale business process automation and deep insights from structured and unstructured data using NLP and ML. He’s contributed to NLP libraries including gensim and Conceptnet 5 and blogs regularly about NLP on forums like Data Science Central, LinkedIn, and his blog Unlock Text. Manas speaks regularly about NLP and text analytics at conferences and meetups, such as PyCon India and PyData, has taught hands-on sessions at IIM Lucknow and MDI Gurgaon, and has mentored students from schools including ISB Hyderabad, BITS Pilani, and the Madras School of Economics. When bored, he falls back on Asimov to lead him into an alternate reality.