Sep 9–12, 2019

Building and managing training datasets for ML with Snorkel

Alex Ratner (Snorkel)
11:05am11:45am Thursday, September 12, 2019
Location: 230 C
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • ML developers, data scientists, and research scientists




One of the key bottlenecks in building ML systems is creating and managing the massive training datasets that today’s models learn from.

Alex Ratner outlines work on Snorkel, an open source framework for building and managing training datasets, and details three key operators for letting users build and manipulate training datasets: labeling functions for labeling unlabeled data, transformation functions for expressing data augmentation strategies, and slicing functions for partitioning and structuring training datasets. These operators allow domain expert users to specify ML models via noisy operators over training data, leading to applications that can be built in hours or days rather than months or years. Alex explores recent work on modeling the noise and imprecision inherent in these operators and using these approaches to train ML models that solve real-world problems, including a recent state-of-the-art result on the SuperGLUE natural language processing benchmark task.

Prerequisite knowledge

  • A basic understanding of machine learning

What you'll learn

  • Discover learning techniques for building, managing, and iterating on training datasets and modeling pipelines for ML in general and using the Snorkel framework
Photo of Alex Ratner

Alex Ratner


Alex Ratner is the project lead of Snorkel, a system for programmatically building and managing training datasets for machine learning, and (starting in 2020) an assistant professor of computer science at the University of Washington. Previously, he completed his PhD in CS advised by Christopher Ré at Stanford, where his research focused on applying data management and statistical learning techniques to emerging machine learning workflows, such as creating and managing training data, and applying this to real-world problems in medicine, knowledge base construction, and more. At Stanford, he started and led the Snorkel project, which has been deployed at large technology companies like Google, academic labs, and government agencies and was recognized in VLDB 2018 (“Best Of”).

  • Intel AI
  • O'Reilly
  • Amazon Web Services
  • IBM Watson
  • Dataiku
  • Dell Technologies
  • Intuit
  • Gamalon
  • Hewlett Packard Enterprise
  • MapR Technologies
  • Sisu Data
  • Intuit

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires