Mar 15–18, 2020

Distributed training in the cloud for production-level NLP models

Liqun Shao (Microsoft)
11:50am12:30pm Wednesday, March 18, 2020
Location: LL20D

Who is this presentation for?

Data scientists or analysts

Level

Intermediate

Description

Researchers have been applying newer deep learning methods to natural language processing (NLP), and data scientists have started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms that allow them to use language models pretrained on a large corpus of data. Building SOTA models at production-level scale can be difficult when you’re on a small team and not both an NLP and DevOps expert. Liqun Shao outlines how to build a robust deep learning pipeline that does distributed deep learning at scale for sentence similarity and question-answering.

With the increasing size of datasets, it’s becoming difficult to train models on a standalone virtual machine (VM). Scaling out model training helps alleviate some of this problem. Using AzureML Compute as a platform for training, along with the distributed training framework Horovod, distributed deep learning becomes fast and easy to use. Liqun demonstrates experiments that compare distributed versus local training, and she recommends ways to build a robust pipeline that scales better with massive datasets.

This work is open source and can be accessed at the GitHub repo. The repository contains utility libraries and a set of Python sample notebooks grouped by scenario and domains that use SOTA algorithms including BERT and GenSen evaluated on popular benchmarks such as SQuAD and SNLI.

Prerequisite knowledge

  • General knowledge of Python
  • A basic understanding of NLP (useful but not required)

What you'll learn

  • Learn ways to train NLP models at scale on Azure
Photo of Liqun Shao

Liqun Shao

Microsoft

Liqun Shao is a data scientist in the AI Development Acceleration Program at Microsoft. She finished her first rotational project on “Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms” with the paper publication in SoCC 2019 and her second one on “Azure Machine Learning Text Analytics Best Practices” with the contribution of the public NLP repo. She earned her bachelor’s of computer science in China and her doctorate in computer science at the University of Massachusetts. Her research areas focus on natural language processing, data mining, and machine learning, especially on title generation, summarization and classification.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires