San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Building high-performance text classifiers on a limited labeling budget

Robert Horton (Microsoft), Mario Inchiosa (Microsoft), Ali Zaidi (Microsoft)

11:00am–11:40am Wednesday, March 27, 2019

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: AI and Data technologies in the cloud, Text and Language processing and analysis

Average rating:

(4.70, 10 ratings)

Download slides (PPTX)

Who is this presentation for?

Data scientists

Level

Intermediate

Prerequisite knowledge

Familiarity with supervised machine learning and performance metrics for binary classifiers

What you'll learn

Learn how to use transfer learning from complex language models trained on large datasets (often by someone else) to generate features that can be used with simple models capable of learning from small datasets
Understand how active learning can help you select examples from a pool of unlabeled cases that will be most effective for training a model
Discover how to use the hyperparameter tuning process on a scalable cloud-based architecture

Description

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Though plentiful data is available in many domains, often the limiting factor in applying supervised machine learning techniques is the availability of useful labels. Labels often represent the interpretation that a human applies to an example, and obtaining such labels can be expensive, particularly in application domains where experts are highly compensated or difficult to find. Active learning is a model-driven selection process that helps to make more effective use of a labeling budget.

Robert, Mario, and Ali start by building a model on a small dataset, then use that model to select additional examples to label. Using multiple rounds of modeling and selection, you can obtain training sets that lead to much better-performing models than would be expected from training on a randomly selected dataset of similar size.

Many state-of-the-art results in natural language processing rely on the ability to use complex models with large datasets to learn rich representations useful for multiple tasks. Robert, Mario, and Ali’s examples use transfer learning from a pretrained language model to generate features that can be effectively used by low-complexity classifier models capable of training on relatively small datasets.

As you integrate machine learning and AI into business processes, even small improvements in predictive performance can translate into huge ROI, so hyperparameter tuning is now an inherent part of many ML pipelines. Robert, Mario, and Ali explain how to leverage Spark clusters in platforms such as Azure Databricks to perform hyperparameter tuning, and detail the improvements this tuning produces in your classifier.

Robert Horton

Microsoft

Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Website

Mario Inchiosa

Microsoft

Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Ali Zaidi

Microsoft

Ali Zaidi is a PhD student in statistics at UC Berkeley. Previously, he was a data scientist in Microsoft’s AI and Research Group, where he worked to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Before that, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com