Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Spark NLP in action: How SelectData uses AI to better understand home health patients

David Talby (Pacific AI), Alberto Andreotti (John Snow Labs), Stacy Ashworth (SelectData), Tawny Nichols (Select Data)
1:10pm–1:50pm Thursday, 09/13/2018
Data science and machine learning
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Health and Medicine, Text and Language processing and analysis
Average rating: ***..
(3.00, 4 ratings)

Who is this presentation for?

  • Data scientists, NLP engineers, and software architects

Prerequisite knowledge

  • Familiarity with machine learning and Spark

What you'll learn

  • Understand how natural language understanding can be applied to patient records and how deep learning can be applied to NLP
  • Explore Spark NLP, an open source NLP library for Apache Spark


Accurately answering clinical and billing questions by reading patient records, which can be a hundred or more pages long, is a challenge even for human domain experts. While traditional rule-based or expression-matching techniques work for simple fields in templated documents, it’s harder to infer facts based on implied statements, the absence of certain statements, or a combination of other facts. Answering such questions at a very high level of accuracy requires state-of-the-art deep learning techniques applied to NLP.

Spark NLP, John Snow Labs’s NLP library for Apache Spark, is an open source library that natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was previously impossible. It provides advanced NLP algorithms like named entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and entity resolution and enables highly efficient training domain-specific machine learning and deep learning NLP models—a prerequisite for high-accuracy question answering.

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols explain how Spark NLP augments the SelectData Data Science Platform to extract fuzzy, implied, and complex facts from home health patient records, covering the technical challenges, the architecture of the full solution, and lessons learned that you can directly apply to your next natural language understanding project.

Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Alberto Andreotti

Alberto Andreotti

John Snow Labs

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies including Motorola, Intel, and Samsung and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

Photo of Stacy Ashworth

Stacy Ashworth


Stacy Ashworth is a registered nurse and chief clinical officer at SelectData. Stacy’s professional interests lie in the use of technology to improve the quality of care through better decision making. An accomplished speaker, she has served as a contributor to the healthcare informatics and technology track of the 2016 Business and Health Administration Association meeting, performing research regarding the evaluation of glucose monitoring technologies for cost-effective and quality control/management of diabetes. She holds a master’s degree in healthcare administration with an emphasis in informatics. Postacute care, geriatrics, and coding may be her passions, but her love is firmly centered on her family of two lively teenagers, a spouse, and a couple of schnauzers to keep things interesting.

Photo of Tawny Nichols

Tawny Nichols

Select Data

Tawny Nichols is chief information officer at SelectData, where she is responsible for new product development, clinical tools, and all technology-related needs. She also leads SelectData’s innovation of data-driven business models. Tawny has over 15 years’ experience supporting the homecare industry. She is currently pursuing an MS in healthcare informatics at the University of San Diego.