Building and managing training datasets for ML with Snorkel

Alex Ratner (Snorkel)

11:05am–11:45am Thursday, September 12, 2019

Location: 230 C

Implementing AI

Secondary topics: Data, Data Networks, Data Quality, Health and Medicine

Average rating:

(5.00, 3 ratings)

Who is this presentation for?

ML developers, data scientists, and research scientists

Level

Intermediate

Description

One of the key bottlenecks in building ML systems is creating and managing the massive training datasets that today’s models learn from.

Alex Ratner outlines work on Snorkel, an open source framework for building and managing training datasets, and details three key operators for letting users build and manipulate training datasets: labeling functions for labeling unlabeled data, transformation functions for expressing data augmentation strategies, and slicing functions for partitioning and structuring training datasets. These operators allow domain expert users to specify ML models via noisy operators over training data, leading to applications that can be built in hours or days rather than months or years. Alex explores recent work on modeling the noise and imprecision inherent in these operators and using these approaches to train ML models that solve real-world problems, including a recent state-of-the-art result on the SuperGLUE natural language processing benchmark task.

Prerequisite knowledge

A basic understanding of machine learning

What you'll learn

Discover learning techniques for building, managing, and iterating on training datasets and modeling pipelines for ML in general and using the Snorkel framework

Alex Ratner

Snorkel

Alex Ratner is the project lead of Snorkel, a system for programmatically building and managing training datasets for machine learning, and (starting in 2020) an assistant professor of computer science at the University of Washington. Previously, he completed his PhD in CS advised by Christopher Ré at Stanford, where his research focused on applying data management and statistical learning techniques to emerging machine learning workflows, such as creating and managing training data, and applying this to real-world problems in medicine, knowledge base construction, and more. At Stanford, he started and led the Snorkel project, which has been deployed at large technology companies like Google, academic labs, and government agencies and was recognized in VLDB 2018 (“Best Of”).