Sep 23–26, 2019

Feature engineering with Spark NLP to accelerate clinical trial recruitment

Saif Addin Ellafi (John Snow Labs), Scott Hoch (
1:15pm1:55pm Wednesday, September 25, 2019
Location: 3B - Expo Hall
Secondary topics:  Health and Medicine, Text and Language processing and analysis

Who is this presentation for?

Data scientists, machine learning engineers, and engineering leaders.




Recruiting patients for clinical trials is a major challenge in drug development. Finding patients requires an in-depth understanding of their medical histories and current health statuses while the majority of patient data is unstructured, and spread across physician notes, pathology, imaging, genomic, and other reports. For this reason, clinical trial recruitment is a slow and manual process. This case study describes how Deep6 uses the Spark NLP platform to apply state-of-the-art deep learning to accurately extract the relevant clinical facts from unstructured text. These facts are then used in subsequent data science pipelines in constructing patients’ medical histories.

John Snow Labs’ NLP library for Apache Spark is an open source library that provides natural language understanding capabilities with state-of-the-art accuracy, performance, and scale. It provides deep-learning based NLP algorithms for named entity recognition, spell checking, sentiment analysis, assertion status detection, entity resolution, OCR and sentence segmentation, and enables highly efficient training of domain-specific machine learning and deep learning NLP models.

We will explain how Deep6 utilizes Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. We will cover the technical challenges, the architecture of the full solution, and lessons learned that you can directly apply to your next natural language understanding project.

Prerequisite knowledge

Basic familiarity with NLP, Spark, and machine learning is assumed.

What you'll learn

Lessons learned and recommendations for achieving state-of-the-art NLP accuracy, performance, and scale in a real-life application. Case study for Spark NLP for Healthcare.
Photo of Saif  Addin Ellafi

Saif Addin Ellafi

John Snow Labs

Saif Addin Ellafi is a software developer at John Snow Labs, where he is the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Said has a wide experience in problem solving and quality assurance in the banking and finance industry.

Scott Hoch

Scott Hoch is a lead data scientist at, where he works on matching patients with clinical trials in minutes, instead of months. Previously, he has been a VP Engineering at Duco, a solutions engineering at Gem HQ, and a data engineer at NationBuilder. He holds a Master degree in Physics from Yale University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts