Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.
Though plentiful data is available in many domains, often the limiting factor in applying supervised machine learning techniques is the availability of useful labels. Labels often represent the interpretation that a human applies to an example, and obtaining such labels can be expensive, particularly in application domains where experts are highly compensated or difficult to find. Active learning is a model-driven selection process that helps to make more effective use of a labeling budget.
Robert, Mario, and Ali start by building a model on a small dataset, then use that model to select additional examples to label. Using multiple rounds of modeling and selection, you can obtain training sets that lead to much better-performing models than would be expected from training on a randomly selected dataset of similar size.
Many state-of-the-art results in natural language processing rely on the ability to use complex models with large datasets to learn rich representations useful for multiple tasks. Robert, Mario, and Ali’s examples use transfer learning from a pretrained language model to generate features that can be effectively used by low-complexity classifier models capable of training on relatively small datasets.
As you integrate machine learning and AI into business processes, even small improvements in predictive performance can translate into huge ROI, so hyperparameter tuning is now an inherent part of many ML pipelines. Robert, Mario, and Ali explain how to leverage Spark clusters in platforms such as Azure Databricks to perform hyperparameter tuning, and detail the improvements this tuning produces in your classifier.
Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.
Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.
Ali Zaidi is a PhD student in statistics at UC Berkeley. Previously, he was a data scientist in Microsoft’s AI and Research Group, where he worked to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Before that, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com