Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning.
Alex Ratner explores data programming, a paradigm for the programmatic creation of training sets in which users express weak supervision strategies or domain heuristics as labeling functions (programs that label subsets of the data but are noisy and may conflict), and demonstrates how to denoise the generated training set by explicitly representing this training set labeling process as a generative model, as well as how to recover the parameters of these generative models in a handful of settings.
Alex then offers an overview of Snorkel, which leverages this new paradigm to make ML systems easier to build, especially for unstructured data extraction, and discusses various real-world applications and recent and ongoing work to extend data programming to new modalities and input types and make it easier for nonexperts to use.
Alex Ratner is a third-year PhD student at the Stanford InfoLab working under Chris Re. Alex works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He coleads the development of the Snorkel framework for lightweight information extraction.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org