Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Natural language understanding at scale with Spark NLP

13:3017:00 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 14
Secondary topics:  Deep Learning, Text and Language processing and analysis

Who is this presentation for?

Data Scientists and Software Developers



Prerequisite knowledge

Familiarity with machine learning, Python, and Spark

Materials or downloads needed in advance

A laptop (You'll be provided a ready-made Docker container that includes the full environment, datasets, and notebooks.)

What you'll learn

Learn how to build advanced high-performance NLP pipelines using Spark NLP, how to integrate them as part of machine learning and deep learning pipelines, and how to scale them in real-world systems


Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.


Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition:

When is an NLP library needed?
Introduction to Spark NLP
Benchmarks and scalability
Built-in Spark NLP annotators
Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and NER
Using pretrained models and pipelines
Building machine learning pipeline that includes and depends on NLP annotators to generate features:

Feature engineering and optimization
Trainable NLP tasks: Spell checker, sentiment analysis, named entity recognition
Applying word embeddings to “featurize” text
Best practices and common pitfalls for creating unified NLP and ML pipelines
Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP:

Why you’ll need to train domain-specific NLP models for most real-world use cases
Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and deidentification
Spark NLP and TensorFlow integration and benefits
Training your own domain-specific deep learning NLP models
Best practices for choosing between alternative NLP algorithms and annotators
Advanced Spark NLP functionality that enables a scalable open source solution to more complex language understanding use cases:

Spell checking and correction
Sentiment analysis & emotion detection
Optical character recognition (OCR) annotators and pipelines
Improving OCR accuracy with customized dictionaries, forms, and spell checkers
Entity resolution versus named entity recognition
An overview of state-of-the-art NLP algorithms and models for healthcare

Photo of Alexander Thomas

Alexander Thomas

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Photo of Claudiu Branzan

Claudiu Branzan

Accenture AI

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)