Named-entity recognition from scratch with spaCy
Who is this presentation for?Data scientists or analysts
A common task in processing unstructured text data is to identify named subjects being discussed such as companies (e.g., Apple) and products (e.g., iPhone). This task is NER and is typically performed using term inventories and pattern matching. However, a proliferation of neural models for natural language processing (NLP) and open source libraries to leverage them has made advanced methods for identifying named entities a more standard part of NLP pipelines.
Benjamin Batorsky offers an overview of neural NER models, focusing on the stack long short-term memory (LSTM) model implemented in the spaCy NLP library. The model treats NER as a set of transitions between words that are part of an entity and those that are outside an entity based on each word’s contextual information and the model’s internal state. Sets of words that represent entities are output along with their predicted labels (e.g., company or product).
You’ll see how the data team used this architecture in spaCy to train a custom model to identify food products and regulatory agencies based on a non-English text corpus. Benjamin creates a training dataset and optimizes and monitors model training, and he goes over the results, demonstrating that the model identified entities held out from the training set and valid entities that hadn’t been part of the original inventory. Benjamin lays out the steps for how to implement the model as part of the NLP pipeline and plans for future work in this area.
The bulk of your time will be spent on implementation and evaluation, the steps of which will be useful even if you have less technical expertise.
- Familiarity with text processing (e.g., cleaning, tokenization)
- General knowledge of how machine learning models are trained and the considerations in evaluation
What you'll learn
- Understand the history and techniques of NER and a specific implementation NER that can be adapted to your own use cases
Benjamin Batorsky is an associate director of data science at MIT Sloan, where he works to derive insight from a rich dataset on small businesses and their customers. Previously, he was lead data scientist at the small business marketing analytics company ThriveHive, and he worked in the areas of health, policy, and infrastructure. He earned a PhD from the RAND Corporation in policy analysis.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires