San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Please log in

Add to Your Schedule

Natural language understanding at scale with Spark NLP

Alexander Thomas (John Snow Labs), Claudiu Branzan (Accenture)

13:30–17:00 Tuesday, 30 April 2019

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Deep Learning, Text and Language processing and analysis

Average rating:

(4.00, 4 ratings)

Who is this presentation for?

Data scientists and software developers

Level

Intermediate

Prerequisite knowledge

Familiarity with machine learning, Python, and Spark

Materials or downloads needed in advance

A laptop (You'll be provided a ready-made Docker container that includes the full environment, datasets, and notebooks.)

What you'll learn

Learn how to build high-performance, high-accuracy NLP pipelines using Spark NLP, how to integrate them as part of machine learning and deep learning pipelines, and how to scale them in real-world systems

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

Claudiu Branzan and Alex Thomas lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working code base that you can change and improve.

Outline:

Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition:

When is an NLP library needed?
Introduction to Spark NLP
Benchmarks and scalability
Built-in Spark NLP annotators
Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and NER
Using pretrained models and pipelines

Building machine learning pipeline that includes and depends on NLP annotators to generate features:

Feature engineering and optimization
Trainable NLP tasks: Spell checker, sentiment analysis, named entity recognition
Applying word embeddings to “featurize” text
Best practices and common pitfalls for creating unified NLP and ML pipelines

Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP:

Why you’ll need to train domain-specific NLP models for most real-world use cases
Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and deidentification
Spark NLP and TensorFlow integration and benefits
Training your own domain-specific deep learning NLP models
Best practices for choosing between alternative NLP algorithms and annotators

Advanced Spark NLP functionality that enables a scalable open source solution to more complex language understanding use cases:

Spell checking and correction
Sentiment analysis and emotion detection
Optical character recognition (OCR) annotators and pipelines
Improving OCR accuracy with customized dictionaries, forms, and spell checkers
Entity resolution versus named entity recognition
An overview of state-of-the-art NLP algorithms and models for healthcare

Alexander Thomas

John Snow Labs

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Claudiu Branzan

Accenture

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Website

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com