Mar 15–18, 2020

Named-entity recognition from scratch with spaCy

Benjamin Batorsky (MIT Sloan)
11:50am12:30pm Wednesday, March 18, 2020
Location: Expo Hall

Who is this presentation for?

Data scientists or analysts




A common task in processing unstructured text data is to identify named subjects being discussed such as companies (e.g., Apple) and products (e.g., iPhone). This task is NER and is typically performed using term inventories and pattern matching. However, a proliferation of neural models for natural language processing (NLP) and open source libraries to leverage them has made advanced methods for identifying named entities a more standard part of NLP pipelines.

Benjamin Batorsky offers an overview of neural NER models, focusing on the stack long short-term memory (LSTM) model implemented in the spaCy NLP library. The model treats NER as a set of transitions between words that are part of an entity and those that are outside an entity based on each word’s contextual information and the model’s internal state. Sets of words that represent entities are output along with their predicted labels (e.g., company or product).

You’ll see how the data team used this architecture in spaCy to train a custom model to identify food products and regulatory agencies based on a non-English text corpus. Benjamin creates a training dataset and optimizes and monitors model training, and he goes over the results, demonstrating that the model identified entities held out from the training set and valid entities that hadn’t been part of the original inventory. Benjamin lays out the steps for how to implement the model as part of the NLP pipeline and plans for future work in this area.

The bulk of your time will be spent on implementation and evaluation, the steps of which will be useful even if you have less technical expertise.

Prerequisite knowledge

  • Familiarity with text processing (e.g., cleaning, tokenization)
  • General knowledge of how machine learning models are trained and the considerations in evaluation

What you'll learn

  • Understand the history and techniques of NER and a specific implementation NER that can be adapted to your own use cases
Photo of Benjamin Batorsky

Benjamin Batorsky

MIT Sloan

Benjamin Batorsky is an associate director of data science at MIT Sloan, where he works to derive insight from a rich dataset on small businesses and their customers. Previously, he was lead data scientist at the small business marketing analytics company ThriveHive, and he worked in the areas of health, policy, and infrastructure. He earned a PhD from the RAND Corporation in policy analysis.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires