Sep 23–26, 2019

Please log in

Add to Your Schedule

Natural language understanding at scale with Spark NLP

David Talby (Pacific AI), Alex Thomas (John Snow Labs), Saif Addin Ellafi (John Snow Labs), Claudiu Branzan (Accenture)

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1A 23/24

Data Science, Machine Learning, & AI

Secondary topics: Deep dive into specific tools, platforms, or frameworks, Text and Language processing and analysis

Average rating:

(3.67, 3 ratings)

View slides

Who is this presentation for?

Practicing data scientists

Level

Intermediate

Description

NLP is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Outline:

Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition

When is an NLP library needed?
Introduction to Spark NLP
Benchmarks and scalability
Built-in Spark NLP annotators
Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and named-entity recognition (NER)
Using pretrained models and pipelines

Building a machine learning pipeline that includes and depends on NLP annotators to generate features

Feature engineering and optimization
Trainable NLP tasks: Spell checker, sentiment analysis, NER
Applying word embeddings to “featurize” text
Best practices and common pitfalls for creating unified NLP and ML pipelines

Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP

Why you’ll need to train domain-specific NLP models for most real-world use cases
Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and de-identification
Spark NLP and TensorFlow integration and benefits
Training your own domain-specific deep learning NLP models
Best practices for choosing between alternative NLP algorithms and annotators

Advanced Spark NLP functionality that enables a scalable open source solution to more complex language-understanding use cases

Optical character recognition (OCR) annotators and pipelines
Improving OCR accuracy with customized dictionaries, forms, and spell checkers
Entity resolution versus named entity recognition
An overview of state-of-the-art NLP algorithms and models for healthcare

Prerequisite knowledge

A working knowledge of Python
Familiarity with the basics of machine learning, deep learning, and Apache Spark

Materials or downloads needed in advance

A laptop with the tutorial environment installed
Complete the setup instructions (to be emailed a week before the conference)

What you'll learn

Gain experience building complete NLP pipelines
Understand the different features and tasks NLP pipelines include, how Spark NLP implements these features, and how to either reuse pretrained models or train custom ones

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Website

Alex Thomas

John Snow Labs

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Saif Addin Ellafi

John Snow Labs

Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Saif has wide experience in problem solving and quality assurance in the banking and finance industry.

Claudiu Branzan

Accenture

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Website

Comments on this page are now closed.

Comments

David Talby | Chief Technology Officer

09/25/2019 4:37am EDT

Hi everyone,

The slide deck of the tutorial is available here:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Spark%20NLP%20Tutorial.pdf

Thank you for participating in the tutorial yesterday. Please do not hesitate to contact any of us with additional questions.

Best of luck in your NLP projects!

David

Sophia DeMartini | Senior Speaker Manager

09/19/2019 12:14pm EDT

Hi Everyone – just a note – Please set up Docker BEFORE you arrive onsite. We want to protect the internet bandwidth as much as possible, so please prepare as much as you can before you come to the conference. Thank you!

David Talby | Chief Technology Officer

09/19/2019 12:03pm EDT

Hi everyone,

We look forward to meeting you on Tuesday at the tutorial! Please bring your laptop, set up Docker, and follow the instructions under ‘Docker Setup’ on this page in advance:

https://github.com/JohnSnowLabs/spark-nlp-workshop

We created a Docker container which includes all the dependencies, libraries, examples, and data you’ll need during the tutorial.

We’ll explain things from scratch, walk through the code, and you’ll have time to extend it and ask questions.

David

David Talby | Chief Technology Officer

09/19/2019 11:54am EDT

Hi Martin,

This looks like an issue in how Docker is set up on your machine. You should get this error if you pull any docker image or run any docker command – try for example just “docker ps”. Here is a great fix & explanation about this:

https://techoverflow.net/2018/12/15/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket/

David

Martin Lurie | Systems Engineer

09/19/2019 10:27am EDT

[marty@cdsw2 strata]$ docker pull johnsnowlabs/spark-nlp-workshop
Using default tag: latest
Warning: failed to get default registry endpoint from daemon (Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.25/info: dial unix /var/run/docker.sock: connect: permission denied). Using system default: https://index.docker.io/v1/
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.25/images/create?fromImage=johnsnowlabs%2Fspark-nlp-workshop&tag=latest: dial unix /var/run/docker.sock: connect: permission denied