Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Machine-learned model quality monitoring in fast data and streaming applications

Emre Velipasaoglu (Lightbend)

14:55–15:35 Wednesday, 23 May 2018

Data science and machine learning, Expo Hall, Streaming systems and real-time applications
Location: Expo Hall Level: Intermediate

Secondary topics: Managing and Deploying Machine Learning

Average rating:

(3.67, 3 ratings)

Download slides (PDF)

Who is this presentation for?

Data scientists, machine learning engineers and developers, engineering leaders, and architects

Prerequisite knowledge

Familiarity with the main problems of machine learning (e.g., classification, regression, and clustering) and statistical testing

What you'll learn

Explore available machine-learned model quality monitoring methods

Description

Most machine learning algorithms are designed to work with stationary data. These algorithms are usually the first ones tried by teams building machine learning applications, because they are readily available in popular open source libraries, such as Python scikit-learn, and distributed machine learning libraries like Spark MLlib. But real-life streaming data is rarely stationary, and its statistical characteristics—as well as quality and relevance of models that depend on it—change over time. Machine-learned models built on data observed within a fixed time period usually suffer loss of prediction quality due to what is known as concept drift.

There are several methods to deal with concept drift. The most common method is periodically retraining the models with new data while perhaps down-weighting the old data or completely removing it. The length of the period is usually determined based on the cost of retraining. The changes in the input data and the quality of predictions are not monitored, and the cost of inaccurate predictions is not included in these calculations.

One alternative on the other end of the complexity spectrum is using adaptive learning methods. However, these algorithms still require tuning of parameters to perform well. An attractive alternative in between is monitoring the machine-learned model quality by testing the inputs and predictions for changes over time, and using change points in retraining decisions. There has been significant development in this area within the last two decades. While most of these methods are appropriate for the classification models, there are some new methods appropriate for regression problems as well.

Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications.

Emre Velipasaoglu

Lightbend

Emre Velipasaoglu is principal data scientist at Lightbend. A machined learning expert, Emre previously served as principal scientist and senior manager at Yahoo! Labs. He has authored 23 peer-reviewed publications and nine patents in search, machine learning, and data mining. Emre holds a PhD in electrical and computer engineering from Purdue University and completed postdoctoral training at Baylor College of Medicine.

Website

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com