Sep 9–12, 2019

Interpreting millions of patient stories with deep-learned OCR & NLP

Stacy Ashworth (SelectData), Alberto Andreotti (John Snow Labs)
4:00pm4:40pm Wednesday, September 11, 2019
Location: 231

Who is this presentation for?

Data scientists, Data engineers, NLP & OCR specialists

Level

Intermediate

Description

Many businesses still depend on documents stored as images – from receipts, manifests, invoices, medical reports, and ID cards snapped with mobile phone cameras, to contracts, waivers, leases, forms and audit records digitized with scanners. Extracting high-quality data from these images brings three challenges. First is OCR – as in dealing with crumpled receipts photographed from an angle in a dimly lit room. Second is NLP – extracting normalized values and entities from the natural language text. The third is building predictors or recommendations that suggest the best next action – and in particular can deal with missing, wrong, or conflicting information generated by the previous steps.

The good news is that state-of-the-art deep learning techniques, now available as open source software, can approach human accuracy in these three tasks – and do so at scale. This talk is a case study of an AI system that reads millions of pages of patient information, gathered from hundreds of sources – resulting in a great variety of image formats, templates, and quality. It describes the solution architecture and key lessons learned in going from raw images to a deployed predictive workflow based on facts extracted from the scanned documents.

This session introduces Spark OCR and Spark NLP: two open-source (Apache licensed), natively distributable, deep-learning based libraries. The OCR library employs adaptive scaling, rotation, and erosion to achieve a significant accuracy boost compared to Tesseract. Spark NLP applies techniques such as BERT embeddings, trainable pipelines, and DL-based sentence segmentation and spell checking – that materially improve accuracy for OCR-sourced text mining. Third, since both libraries are native extensions of Apache Spark, a unified pipeline can be written in Python, Java or Scala for all three stages (including ML based on the results of OCR and NLP) – enabling a new level of scale, speed, and reproducibility for the entire pipeline from image to next best action. Notebooks with example code will be made public after the talk.

Prerequisite knowledge

Basic familiarity with Spark, machine learning and deep learning is assumed.

What you'll learn

Learn how to deliver state-of-the-art OCR & NLP as part of a complete decision support workflow, using open source tools and at scale.
Photo of Stacy Ashworth

Stacy Ashworth

SelectData

Stacy Ashworth is a registered nurse and chief clinical officer at SelectData. Stacy’s professional interests lie in the use of technology to improve the quality of care through better decision making. An accomplished speaker, she has served as a contributor to the Healthcare Informatics and Technology track of the 2016 Business and Health Administration Association meeting, performing research regarding the evaluation of glucose monitoring technologies for cost-effective and quality control/management of diabetes. She holds a master’s degree in healthcare administration with an emphasis in informatics. Postacute care, geriatrics, and coding may be her passions, but her love is firmly centered on her family of two lively teenagers, a spouse, and a couple of schnauzers to keep things interesting.

Photo of Alberto Andreotti

Alberto Andreotti

John Snow Labs

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he is implementing state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies including Motorola, Intel, and Samsung and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of O'Reilly AI contacts