Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Natural language understanding at scale with spaCy and Spark NLP

David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

1:30pm–5:00pm Tuesday, March 6, 2018

Data science and machine learning
Location: LL20 C

Average rating:

(5.00, 1 rating)

Who is this presentation for?

Data scientists

Prerequisite knowledge

A working knowledge of Python, Spark, and machine learning

Materials or downloads needed in advance

A laptop with the course Docker container, libraries, and notebooks downloaded.
To be able to run the materials for this tutorial, please follow the instructions from the "Installation Instructions.pdf" you can find at https://github.com/melcutz/NLU_tutorial
IMPORTANT: These instructions imply downloading and running a large Docker container on your laptop and we advise you to do this well in advance of the event.

What you'll learn

Gain hands-on experience with common NLP tasks and pipelines using spaCy and Spark NLP

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for training distributed custom natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. You’ll spend about half your time coding as you work through three sections, each with an end-to-end working codebase that you are then asked to change and improve.

Outline

Using spaCy to build an NLP annotations pipeline that can understand text structure, grammar, and sentiment and perform entity recognition

Built-in spaCy annotators
Debugging and visualizing results
Creating custom pipelines
Practical trade-offs for large-scale projects, as well as for balancing performance and accuracy

Using TensorFlow to build domain-specific machine-learned annotators and then integrating them into an existing NLP pipeline

Feature engineering and optimization
Measurement
Practical considerations when working on problems that require understanding text beyond keyword matching and one-hot encoding

Using Spark ML and TensorFlow to apply deep learning to expand and update ontologies

Comparison of word2vec and doc2vec
When each is useful
How to apply them to increase the accuracy of classification or information retrieval problems
Current trade-offs in integrating spaCy and Spark when engineering distributed, large-scale NLP pipelines

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Website

Claudiu Branzan

Accenture

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Website

Alex Thomas

John Snow Labs

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Comments on this page are now closed.

Comments

Sridhar Alla | COFOUNDER AND CTO

03/06/2018 6:34am PST

I setup an AWS instance for the lab.

18.219.0.106:8888/?token=1e119ad4ea1671e1bde4a489ac9b6d44ec0cf9fa4892c524

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com