Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Natural language understanding at scale with Spark NLP

David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
1:30pm–5:00pm Tuesday, 09/11/2018
Data science and machine learning
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Text and Language processing and analysis
Average rating: ***..
(3.00, 7 ratings)

Who is this presentation for?

  • Data scientists, NLP engineers, and AI software architects and leaders

Prerequisite knowledge

  • Familiarity with machine learning, Python, and Spark

Materials or downloads needed in advance

  • A laptop (You'll be provided a ready-made Docker container that includes the full environment, datasets, and notebooks.)

What you'll learn

  • Learn how to build advanced high-performance NLP pipelines using Spark NLP, how to integrate them as part of machine learning and deep learning pipelines, and how to scale them in real-world systems

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Outline:

Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition

  • When is an NLP library needed?
  • Introduction to Spark NLP
  • Benchmarks and scalability
  • Built-in Spark NLP annotators
  • Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and NER
  • Creating custom pipelines

Using Spark ML, scikit-learn, or TensorFlow to build a domain-specific machine learning pipeline that includes and depends on NLP annotators to generate features

  • Feature engineering and optimization
  • Measurement
  • Trainable NLP tasks: Spell checker, sentiment analysis, named entity recognition
  • Applying word embeddings to “featurize” text
  • Best practices and common pitfalls for creating unified NLP and ML pipelines

Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP

  • Why you’ll need to train domain-specific NLP models for most real-world use cases
  • Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and deidentification
  • Spark NLP and TensorFlow integration and benefits
  • Training your own domain-specific deep learning NLP models
  • Using pretrained models and pipelines
  • Best practices for choosing between alternative NLP algorithms and annotators

Advanced Spark NLP functionality that enables a scalable open source solution to more complex language understanding use cases

  • Optical character recognition (OCR) annotators and pipelines
  • Improving OCR accuracy with customized dictionaries, forms, and spell checkers
  • Entity resolution versus named entity recognition
  • Building automated knowledge graphs using entity resolution and curated ontologies
  • An overview of state-of-the-art NLP algorithms and models for healthcare
Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Claudiu Branzan

Claudiu Branzan

G2 Web Services

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Photo of Alexander Thomas

Alexander Thomas

Indeed

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Comments on this page are now closed.

Comments

Picture of Claudiu Branzan
Claudiu Branzan | VICE PRESIDENT, DATA SCIENCE AND ENGINEERING
09/12/2018 6:44am EDT

Installation instructions:

Download and install Docker (Community Edition / Stable channel) following the instructions on:

https://docs.docker.com/install/

After Docker is running on your machine, run the following command to get the Docker container for this tutorial on your machine:

docker pull alnith/strata:v1

Because of all the dependencies (Spark, Tensorflow, etc.) and all the training data, the image file is very big (14.8GB) so it might take some time to download.

Once the command finishes successfully and you have the image on your machine (use ‘docker images’ to validate), use the following command to start the Docker container:

docker run –it —rm –p 8888:8888 alnith/strata:v1

If you don’t get an error message you’re all good and should be able to find Jupyter running on localhost:8888