Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Building high-performance text classifiers on a limited labeling budget

Robert Horton (Microsoft), Mario Inchiosa (Microsoft), Ali Zaidi (Microsoft)
11:00am11:40am Wednesday, March 27, 2019
Average rating: ****.
(4.70, 10 ratings)

Who is this presentation for?

  • Data scientists



Prerequisite knowledge

  • Familiarity with supervised machine learning and performance metrics for binary classifiers

What you'll learn

  • Learn how to use transfer learning from complex language models trained on large datasets (often by someone else) to generate features that can be used with simple models capable of learning from small datasets
  • Understand how active learning can help you select examples from a pool of unlabeled cases that will be most effective for training a model
  • Discover how to use the hyperparameter tuning process on a scalable cloud-based architecture


Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Though plentiful data is available in many domains, often the limiting factor in applying supervised machine learning techniques is the availability of useful labels. Labels often represent the interpretation that a human applies to an example, and obtaining such labels can be expensive, particularly in application domains where experts are highly compensated or difficult to find. Active learning is a model-driven selection process that helps to make more effective use of a labeling budget.

Robert, Mario, and Ali start by building a model on a small dataset, then use that model to select additional examples to label. Using multiple rounds of modeling and selection, you can obtain training sets that lead to much better-performing models than would be expected from training on a randomly selected dataset of similar size.

Many state-of-the-art results in natural language processing rely on the ability to use complex models with large datasets to learn rich representations useful for multiple tasks. Robert, Mario, and Ali’s examples use transfer learning from a pretrained language model to generate features that can be effectively used by low-complexity classifier models capable of training on relatively small datasets.

As you integrate machine learning and AI into business processes, even small improvements in predictive performance can translate into huge ROI, so hyperparameter tuning is now an inherent part of many ML pipelines. Robert, Mario, and Ali explain how to leverage Spark clusters in platforms such as Azure Databricks to perform hyperparameter tuning, and detail the improvements this tuning produces in your classifier.

Photo of Robert Horton

Robert Horton


Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Photo of Mario Inchiosa

Mario Inchiosa


Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Photo of Ali Zaidi

Ali Zaidi


Ali Zaidi is a PhD student in statistics at UC Berkeley. Previously, he was a data scientist in Microsoft’s AI and Research Group, where he worked to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Before that, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.