Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow

David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
1:30pm5:00pm Tuesday, September 26, 2017
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Deep learning, Pydata, Text

Who is this presentation for?

  • Data scientists, software engineers, and solution architects

Prerequisite knowledge

  • A working knowledge of Spark and machine learning

Materials or downloads needed in advance

  • A laptop with the course Docker container installed and working (will be provided prior to the tutorial)

What you'll learn

  • Gain hands-on experience using spaCy, TensorFlow, and Spark ML to construct natural language processing pipelines

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. You’ll spend about half your time coding as you work through three sections, each with an end-to-end working codebase that you are then asked to change and improve.

Outline

  • Using spaCy to build an NLP annotations pipeline that can understand text structure, grammar, and sentiment and perform entity recognition: You’ll cover the built-in spaCy annotators, debugging and visualizing results, creating custom pipelines, and practical trade-offs for large scale projects, as well as for balancing performance versus accuracy.
  • Using TensorFlow to build domain specific, machine-learned annotators and then integrating them into an existing NLP pipeline: You’ll explore feature engineering, optimization, measurement, and specific practical considerations when working on problems that require understanding text beyond keyword matching and one-hot encoding.
  • Using Spark ML and TensorFlow to apply deep learning to expand and update ontologies: You’ll compare existing implementations of word2vec and doc2vec, learn when they are useful, and see how they can be applied in practice to increase the accuracy of classification or information retrieval problems. You’ll also examine current trade-offs in integrating spaCy and Spark when engineering distributed, large-scale NLP pipelines.
Photo of David Talby

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Photo of Claudiu Branzan

Claudiu Branzan

G2 Web Services

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Photo of Alexander Thomas

Alexander Thomas

Indeed

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Picture of David Talby
David Talby | CTO
10/17/2017 11:44am EDT

Vivek, the issue is probably that the docker run command’s rm parameter needs to have two preceding dashes:

docker run -it —rm -p 8888:8888 melcutz/nlu-demo

You can also try it within the —rm parameter. Here is the reference for ‘docker run’:
https://docs.docker.com/engine/reference/run/

Hope this helps! Please ask again if this still is an issue.

Vivek Ranjan | DATA SCIENTIST
10/17/2017 11:18am EDT

I can run the following from the shell:
docker run —rm -it melcutz/nlu-demo bash
and get a shell after executing the above.

However, when I run the following:
docker run –it —rm –p 8888:8888 melcutz/nlu-demo
I get the following error:
docker: invalid reference format.
See ‘docker run —help’.

Looking for help to run the demo.

Picture of Claudiu Branzan
Claudiu Branzan | DIRECTOR OF DATA SCIENCE
09/29/2017 7:12am EDT

If you liked this session, please rate it and leave comments. It helps us adjust the content and tune it to your preferences. Thank you!

Picture of David Talby
David Talby | CTO
09/26/2017 12:11pm EDT

Here is the homepage for the Spark-NLP library:

http://nlp.johnsnowlabs.com

Here is the public GitHub repository for it:

https://github.com/JohnSnowLabs/spark-nlp

Picture of Claudiu Branzan
Claudiu Branzan | DIRECTOR OF DATA SCIENCE
09/25/2017 3:47pm EDT

We will be available 30 minutes prior to the session tomorrow to support you in getting the docker container up and running. As David mentioned earlier, due to text differences in text formatting some commands (the docker run options) have been malformed and they can cause issues but we’ll get you going in no time. Just make sure you download and install the docker image prior to showing up for the tutorial as that’s the most time-consuming piece and I’m sure the WiFi connection we have at the conference venue doesn’t support too many simultaneous downloads of such big files.

Picture of David Talby
David Talby | CTO
09/25/2017 2:46pm EDT

Hi Philipp, thanks for bringing up the point. The ‘rm’ option for ‘focker run’ does need to be preceded by two dashes. The email client seems to have connected them. That’s also the case dor Docker’s online docs, where only some fonts show the correct two-dashes format:
https://docs.docker.com/engine/reference/run/#clean-up-rm

Philipp Reiner | REGIONAL BUSINESS INTELLIGENCE ANALYST
09/25/2017 1:26pm EDT

when running the docker run command I get a response “docker: invalid reference format”. in your emails, are those options all preceded by two dashes? they all look like em-dashes in your email, so I was wondering if your email client reformatted those. the help for the run command is telling me to specify “—rm”… but in any case, also — does not work and I get the same error. so thanks for your help!

Picture of Claudiu Branzan
Claudiu Branzan | DIRECTOR OF DATA SCIENCE
09/23/2017 5:51am EDT

To be able to run the materials for this tutorial, please follow the instructions from the Installation Instructions.pdf you can find at *https://github.com/melcutz/NLU_tutorial *

Narendra Prasad Karnatam | BIGDATA SOLUTION ARCHITECT
09/22/2017 9:31am EDT

Hi There,

How do i install Docker Container in my laptop?

Thanks,
Naren

Picture of Claudiu Branzan
Claudiu Branzan | DIRECTOR OF DATA SCIENCE
09/20/2017 7:35am EDT

I think you need to register for this Tutorial.
Materials (namely the docker container and instructions to use install and run it) will be provided prior to the tutorial. We’re actively working to have it publish and ready for you before Monday 9/25.

Picture of Greg Hayworth
Greg Hayworth | DATA SCIENTIST
09/20/2017 7:25am EDT

need:“course Docker container installed and working (will be provided prior to the tutorial)”

Is this something that will happen the day of the conference or will material be provided ahead of time in some way?

Celia Chen | DATA SCIENTIST
09/18/2017 7:01am EDT

Hi there, Do we need to register before hand for the Tuesday Tutorial? Thanks.