Advanced natural language processing with Spark NLP
Who is this presentation for?Data scientists or analysts
NLP is a key component in many data science systems that must understand or reason about text. David Talby, Alex Thomas, Claudiu Branzan, and Veysel Kocaman use the open source Spark NLP library to explore advanced NLP in Python. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It’s the most widely used NLP library in the enterprise today.
You’ll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, object character recognition (OCR), document classification, and multilingual and multidomain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and “post-BERT” research results like XLNet, ERNIE, and roBERTa.
Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it’s the only open source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and NVIDIA. You’ll run the examples locally on your laptop, but they explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.
- A working knowledge of Python, Jupyter notebooks, and basic machine learning
- Familiarity with deep learning and NLP
Materials or downloads needed in advance
- A laptop capable of running Docker with 8GB or more (You'll receive an email a week prior with instructions on how to pull and run the local Docker container that contains the library, all dependencies, example notebooks, and datasets.)
What you'll learn
- Get hands-on experience with building NLP pipelines, both training and inference, for common NLP tasks, using Spark NLP in Python
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.
John Snow Labs
Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.
Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.
John Snow Labs
Veysel Kocaman is a senior data scientist and ML engineer at John Snow Labs. He has a decade’s experience in the industry and provides hands-on consulting services in ML and AI, statistics, data science, and operations research to several startups and companies around the globe. Previously, Veysel has been a CTO, head of AI, and principal data scientist, among other titles. He earned his PhD in computer science at Leiden University (the Netherlands) and an MS degree in operations research from Penn State University.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
Diversity & Inclusion Sponsor
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires