Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Data programming: Creating large training sets quickly

Alex Ratner (Stanford University)
Artificial Intelligence, Machine Learning & Data Science
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Hardcore Data Science
Average rating: ****.
(4.40, 5 ratings)

What you'll learn


Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning.

Alex Ratner explores data programming, a paradigm for the programmatic creation of training sets in which users express weak supervision strategies or domain heuristics as labeling functions (programs that label subsets of the data but are noisy and may conflict), and demonstrates how to denoise the generated training set by explicitly representing this training set labeling process as a generative model, as well as how to recover the parameters of these generative models in a handful of settings.

Alex then offers an overview of Snorkel, which leverages this new paradigm to make ML systems easier to build, especially for unstructured data extraction, and discusses various real-world applications and recent and ongoing work to extend data programming to new modalities and input types and make it easier for nonexperts to use.

Photo of Alex Ratner

Alex Ratner

Stanford University

Alex Ratner is a third-year PhD student at the Stanford InfoLab working under Chris Re. Alex works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He coleads the development of the Snorkel framework for lightweight information extraction.