Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale

Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)
11:1511:55 Wednesday, 1 May 2019
Data Science, Machine Learning & AI
Location: Capital Suite 14
Average rating: ****.
(4.67, 3 ratings)

Who is this presentation for?

  • Data scientists and software developers who work with text



Prerequisite knowledge

  • Familiarity with Apache Spark

What you'll learn

  • Learn how to use Spark NLP to process text and how to standardize text fields


More people find jobs on Indeed than anywhere else. With two hundred million unique visitors a month, Indeed has accumulated hundreds of millions of jobs and résumés and trillions of data points of activity. Much of this data is entered by users. Because users express the same or similar facts in different ways, Indeed needs to standardize these fields. The traditional solution is to use a human-curated list of replacement rules. But with datasets as large and diverse as Indeed’s, the better solution is to use the data to normalize itself.

Spark NLP—John Snow Labs’ NLP library for Apache Spark—is an open source library that natively extends Spark ML to provide natural language processing capabilities with high performance, accuracy, and scalability. Spark NLP has algorithms that consist of rule-based, machine learning, and deep learning models. It provides advanced NLP functionalities like named-entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and others. These algorithms are combined via NLP pipelines to automate the multiple steps necessary to normalize natural language text, from spelling correction to stemming to using corpus statistics to identify preferred forms.

Alexis Yelton and Alex Thomas explain how to combine Spark NLP with Apache Spark’s built-in algorithms to create standardized semistructured text directly from résumés and job descriptions. These standardized strings can then be used to improve résumé or job search engines or to feed into machine learning models used for everything from predicting apply rates to recommending jobs to job seekers. Join in to explore the technical challenges, the algorithms, and how you can use them in your next text-processing project.

Photo of Alexander Thomas

Alexander Thomas

John Snow Labs

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Photo of Alexis Yelton

Alexis Yelton


Alexis Yelton is a data scientist at Indeed focusing on building machine learning models for software products. She’s been working with Spark since version 1.6 and has recently moved into the NLP space. She holds a PhD in bioinformatics and did postdoctoral work building models to predict gene function and explain ecosystem function.

Comments on this page are now closed.


Picture of Alexis Yelton
Alexis Yelton | DATA SCIENTIST
6/05/2019 16:09 BST

I have posted the slides on LinkedIn:

3/05/2019 11:47 BST

Is it possible to get the slides of your presentation?