Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Natural language understanding at scale with spaCy and Spark NLP

David Talby (Pacific AI), Claudiu Branzan (G2 Web Services)
13:3017:00 Tuesday, 22 May 2018
Data science and machine learning
Location: Capital Suite 12 Level: Intermediate

Who is this presentation for?

Practicing data scientists, machine learning engineers, architects and engineering managers

Prerequisite knowledge

Working knowledge of Python, machine learning, and Apache Spark.

Materials or downloads needed in advance

Bring your own laptop so that you'll be able to code and hack with the example notebooks and exercises we'll cover.

What you'll learn

Hands-on experience building NLP pipelines, machine learning and deep learning models based on them - in both spaCy and Spark NLP.

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby and Claudiu Branzan lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for training distributed custom natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. You’ll spend about half your time coding as you work through three sections, each with an end-to-end working codebase that you are then asked to change and improve.

Outline

Using spaCy to build an NLP annotations pipeline that can understand text structure, grammar, and sentiment and perform entity recognition

You’ll cover the built-in spaCy annotators, debugging & visualizing results, creating custom pipelines, and practical trade-offs for large scale projects, as well as for balancing performance vs accuracy.
Using TensorFlow to build domain specific, machine-learned annotators and then integrating them into an existing NLP pipeline

You’ll explore feature engineering, optimization, measurement, and specific practical considerations when working on problems that require understanding text beyond keyword matching and one-hot encoding.
Using Spark ML and TensorFlow to apply deep learning to expand and update ontologies

You’ll compare existing implementations of word2vec and doc2vec, learn when they are useful, and see how they can be applied in practice to increase the accuracy of classification or information retrieval problems. You’ll also examine current trade-offs in integrating spaCy and Spark when engineering distributed, large-scale NLP pipelines.

Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Claudiu Branzan

Claudiu Branzan

G2 Web Services

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)