Natural language understanding at scale with Spark NLP
Who is this presentation for?
- Practicing data scientists
NLP is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, entity recognition, sentiment analysis, dependency parsing, de-identification, and natural language BI. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.
David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.
Using Spark NLP to build an NLP pipeline that can understand text structure, grammar, and sentiment and perform entity recognition
- When is an NLP library needed?
- Introduction to Spark NLP
- Benchmarks and scalability
- Built-in Spark NLP annotators
- Core NLP tasks: Tokenizer, normalizer, stemmer, lemmatizer, chunker, POS, and named-entity recognition (NER)
- Using pretrained models and pipelines
Building a machine learning pipeline that includes and depends on NLP annotators to generate features
- Feature engineering and optimization
- Trainable NLP tasks: Spell checker, sentiment analysis, NER
- Applying word embeddings to “featurize” text
- Best practices and common pitfalls for creating unified NLP and ML pipelines
Using Spark NLP with TensorFlow to train deep learning models for state-of-the-art NLP
- Why you’ll need to train domain-specific NLP models for most real-world use cases
- Recent deep learning research results for named entity recognition, entity resolution, assertion status detection, and de-identification
- Spark NLP and TensorFlow integration and benefits
- Training your own domain-specific deep learning NLP models
- Best practices for choosing between alternative NLP algorithms and annotators
Advanced Spark NLP functionality that enables a scalable open source solution to more complex language-understanding use cases
- Optical character recognition (OCR) annotators and pipelines
- Improving OCR accuracy with customized dictionaries, forms, and spell checkers
- Entity resolution versus named entity recognition
- An overview of state-of-the-art NLP algorithms and models for healthcare
- A working knowledge of Python
- Familiarity with the basics of machine learning, deep learning, and Apache Spark
Materials or downloads needed in advance
- A laptop with the tutorial environment installed
- Complete the setup instructions (to be emailed a week before the conference)
What you'll learn
- Gain experience building complete NLP pipelines
- Understand the different features and tasks NLP pipelines include, how Spark NLP implements these features, and how to either reuse pretrained models or train custom ones
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.
John Snow Labs
Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.
Saif Addin Ellafi
John Snow Labs
Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Saif has wide experience in problem solving and quality assurance in the banking and finance industry.
Claudiu Branzan is a analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies utilizing big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts