Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Digging for gold: Developing AI in healthcare against unstructured text data

Chiny Driscoll (MetiStream), Jawad Khan (Rush University Medical Center )
2:00pm–2:40pm Thursday, 09/13/2018
Strata Business Summit
Location: 1E 12/13 Level: Non-technical
Secondary topics:  Health and Medicine, Text and Language processing and analysis
Average rating: ****.
(4.00, 5 ratings)

Who is this presentation for?

  • Anyone working in business or technology, particularly those within the healthcare industry

What you'll learn

  • Learn approaches for processing unstructured data
  • Understand the value of NLP in advanced analytics, how to apply ML models to various datasets, and best practices for operationalizing ML models and ML processes


Healthcare faces significant growth in both structured and unstructured data, with challenges including diverse data formats, interoperability, regulatory requirements, and the need for advanced analytics. Today the industry is a $3 trillion market employing the use of AI, big data, and data across all formats and varieties being to drive innovative solutions that can improve patient care, increase efficiencies, and financial outcomes.

Over the years, electronic healthcare records (EHRs) have become more sophisticated and feature rich, but despite these advances, doctors still enjoy the simplicity of summarizing their patient feedback in freeform clinical notes. Unfortunately, many of these notes are coded properly after the fact, leaving critical information forgotten and potentially locked away in the patient record. This includes valuable information such as medications, symptoms, and diagnoses—all of which could be used to inform future care. The unstructured nature of clinical notes makes it difficult to process this data, join it with other high-value healthcare datasets such as pathology, genomics or billing records, and produce vital analytics such as key performance indicators (KPIs) or risk predictions. All of this can result in poor patient outcomes and lost opportunities for revenue or operational efficiencies.

Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses. The solution leverages key industry open source solutions such as cTAKES, a natural language processing (NLP) suite developed for the extraction of information from clinical free-text in EHR solutions. cTAKES uses the Unstructured Information Management Architecture (UIMA) framework for implementing its NLP pipeline, which annotates free text to discern clinical terms and then normalizes these terms to well-known ontology codes (notably UMLS CUI, Snomed-CT, and RxNorm).

Used in conjunction with NLP and Apache Spark, years of unstructured text can be processed within hours. Integration with Spark also allows the processed notes to capitalize on the powerful and inherent machine learning capabilities available through Spark MLlib and ML Pipelines, enabling annotated clinical data to be used to train a model and develop risk predictions. The output of insight from the predictive models and AI can then be leveraged to easily distribute and operationalize these models throughout the enterprise.

Topics include:

  • How to automatically structure, annotate, and index clinical notes at scale
  • How to process data in real time or batch, including multiyear historical loads
  • How to conduct free-form text search with millisecond response times
  • How to build machine learning models against the output of the structured text
  • How to build and deploy advanced analytics within clinical workflows
Photo of Chiny Driscoll

Chiny Driscoll


Chiny Driscoll is founder and CEO at MetiStream, a provider of real-time integration and analytic services in the big data arena. Chiny has more than 24 years of management and executive leadership experience in the technology industry and has served in a variety of roles with Fortune 500 tech companies. Previously, Chiny was the worldwide executive leader of big data services for IBM’s Information Management Division, where she led all of the professional services which implemented and supported IBM’s big data products and solutions, including streaming, analytics, Hadoop, and DW appliance-related, across industries such as financial services, communications, the public sector, and retail; was the vice president and general manager of Netezza, a leader in big data warehouse appliances and advanced analytics (acquired by IBM in 2010); held various global and regional leadership roles at TIBCO Software, where her responsibilities included running the presales, services, and sales operations for the Public Sector Division; and served in services leadership roles at EDS and other services and technology companies.

Photo of Jawad Khan

Jawad Khan

Rush University Medical Center

Jawad Khan is director of data sciences and knowledge management at Rush University Medical Center, where he leads Rush’s analytics and data strategy, focusing on leveraging data from all sections of the business, including clinical, ERP, security, device sensors, and people/patient-generated data, to provide improved safety, better clinical outcomes, reduced cost, and innovation. Jawad has more than 20 years of experience in analytics, software development, data management, and data security. Previously, he was a lead architect at Century Link, where he provided cloud enablement strategies for data and applications to clients like GE Capital, Coca-Cola, Proctor & Gamble, and Warner Bros., and a managing director at Opus Capital Markets, where he was responsible for leading analytics, data security and compliance, and software development as well as data center and infrastructure development and operations. He also worked as a software engineer consultant for one of the Big Six consulting firms. Jawad holds a degree in computer engineering from Southern Illinois University. He speaks regularly at professional and community events and is a Cricket commentator for Chicago NPR affiliate WBEZ.