Sep 9–12, 2019

Interpreting millions of patient stories with deep learned OCR and NLP

Stacy Ashworth (SelectData), Alberto Andreotti (John Snow Labs)
4:00pm4:40pm Wednesday, September 11, 2019
Location: LL21 C/D

Who is this presentation for?

  • Data scientists, data engineers, and NLP and OCR specialists




Many businesses still depend on documents stored as images—from receipts, manifests, invoices, medical reports, and ID cards snapped with mobile phone cameras to contracts, waivers, leases, forms, and audit records digitized with scanners. Extracting high-quality data from these images comes with three challenges. First is OCR, as in dealing with crumpled receipts photographed from an angle in a dimly lit room. Second is NLP, extracting normalized values and entities from the natural language text. The third is building predictors or recommendations that suggest the best next action—and in particular can deal with missing, wrong, or conflicting information generated by the previous steps.

The good news is that state-of-the-art deep learning techniques, now available as open source software, can approach human accuracy in these three tasks—and do so at scale. Stacy Ashworth and Alberto Andreotti explore a case study of an AI system that reads millions of pages of patient information, gathered from hundreds of sources, resulting in a great variety of image formats, templates, and quality. They explore the solution architecture and key lessons learned in going from raw images to a deployed predictive workflow based on facts extracted from the scanned documents.

You’ll be introduced to Spark OCR and Spark NLP, two open source (Apache licensed), natively distributable, deep learning-based libraries. The OCR library employs adaptive scaling, rotation, and erosion to achieve a significant accuracy boost compared to Tesseract. Spark NLP applies techniques such as BERT embeddings, trainable pipelines, and DL-based sentence segmentation and spell checking that materially improve accuracy for OCR-sourced text mining. Since both libraries are native extensions of Apache Spark, a unified pipeline can be written in Python, Java, or Scala for all three stages (including ML based on the results of OCR and NLP), enabling a new level of scale, speed, and reproducibility for the entire pipeline from image to next-best action. Notebooks with example code will be made public afterward.

Prerequisite knowledge

  • Familiarity with Spark, machine learning, and deep learning

What you'll learn

  • Learn how to deliver state-of-the-art OCR and NLP as part of a complete decision-support workflow using open source tools and at scale
Photo of Stacy Ashworth

Stacy Ashworth


Stacy Ashworth is a registered nurse and chief clinical officer at SelectData. Stacy’s professional interests lie in the use of technology to improve the quality of care through better decision making. An accomplished speaker, she has served as a contributor to the healthcare informatics and technology track of the 2016 Business and Health Administration Association meeting, performing research regarding the evaluation of glucose monitoring technologies for cost-effective and quality control/management of diabetes. She holds a master’s degree in healthcare administration with an emphasis in informatics. Postacute care, geriatrics, and coding may be her passions, but her love is firmly centered on her family of two lively teenagers, a spouse, and a couple of schnauzers to keep things interesting.

Photo of Alberto Andreotti

Alberto Andreotti

John Snow Labs

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies including Motorola, Intel, and Samsung and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

  • Intel AI
  • O'Reilly
  • Amazon Web Services
  • IBM Watson
  • Dataiku
  • Dell Technologies
  • Intuit
  • Gamalon
  • Hewlett Packard Enterprise
  • MapR Technologies
  • Sisu Data
  • Intuit

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires