Sep 23–26, 2019

Natural language understanding at scale with Spark NLP

David Talby (Pacific AI), Alex Thomas (Indeed), Saif Addin Ellafi (John Snow Labs)
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1A 23/24
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Text and Language processing and analysis

Who is this presentation for?

Practicing data scientists




Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

This is a hands-on tutorial on state-of-the-art NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.


Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition:

When is an NLP library needed?
Introduction to Spark NLP
Benchmarks and scalability
Built-in Spark NLP annotators
Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and NER
Using pretrained models and pipelines

Building machine learning pipeline that includes and depends on NLP annotators to generate features:

Feature engineering and optimization
Trainable NLP tasks: Spell checker, sentiment analysis, named entity recognition
Applying word embeddings to “featurize” text
Best practices and common pitfalls for creating unified NLP and ML pipelines

Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP:

Why you’ll need to train domain-specific NLP models for most real-world use cases
Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and deidentification
Spark NLP and TensorFlow integration and benefits
Training your own domain-specific deep learning NLP models
Best practices for choosing between alternative NLP algorithms and annotators

Advanced Spark NLP functionality that enables a scalable open source solution to more complex language understanding use cases:

Optical character recognition (OCR) annotators and pipelines
Improving OCR accuracy with customized dictionaries, forms, and spell checkers
Entity resolution versus named entity recognition
An overview of state-of-the-art NLP algorithms and models for healthcare

Prerequisite knowledge

Working knowledge of Python is required. Familiarity with the basics of machine learning, deep learning, and Apache Spark is assumed.

Materials or downloads needed in advance

Attendees should bring their own laptop. Instructions on downloading and installing the tutorial environment will be emailed a week before the tutorial.

What you'll learn

Experience building complete NLP pipelines, understanding of the different features and tasks they include, knowledge of how Spark NLP implements these features, and the ability to either reuse pre-trained models or train custom ones.
Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Alex Thomas

Alex Thomas


Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Photo of Saif  Addin Ellafi

Saif Addin Ellafi

John Snow Labs

Saif Addin Ellafi is a software developer at John Snow Labs, where he is the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Said has a wide experience in problem solving and quality assurance in the banking and finance industry.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts