Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

3:30pm–4:10pm Thursday, 09/13/2018

Data engineering and architecture
Location: 1E 07/08 Level: Intermediate

Secondary topics: Temporal data and time-series analytics

View slides

Who is this presentation for?

Data scientists and machine learning engineers

Prerequisite knowledge

A working knowledge of Structured Streaming in Apache Spark
An intermediate understanding of machine learning

What you'll learn

Understand how StreamDM can be used alongside Structured Streaming and the relevance of addressing concept drift when learning from streaming data
Learn how to implement active and reactive strategies to address this problem using StreamDM

Description

Adapting StreamDM to the novel Structured Streaming engine simplifies both its use and development. Currently, the open source StreamDM library provides the largest collection of data stream mining algorithms for Spark, including both supervised and unsupervised learning algorithms that can be updated online. The main difference between batch machine learning implementations in Spark (MLlib and Spark ML) and StreamDM is that the latter focus on algorithms that can be trained and adapted incrementally. This can be a huge advantage in some domains as it enables automatically updating the learning models. StreamDM is currently under development by Huawei Noah’s Ark Lab and Télécom ParisTech.

There is a vast literature on the topic of addressing concept drift and learning from streaming data. Still, these methods can be complex to implement and integrate. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts). Adapting StreamDM for Structured Streaming is a natural step that facilitates future integration with major technology improvements, such as continuous processing. Heitor and Albert also introduce a simple yet powerful methodology to address concept drifts using active strategies like combining ensemble models and drift detectors and reactive strategies and reactive strategies like forgetting mechanisms and periodical resets (windowed approaches).

Heitor Murilo Gomes

Télécom ParisTech

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly, evolving data streams, concept drift, ensemble methods, and big data streams. He coleads the streamDM open data stream mining project.

Website

Albert Bifet

Télécom ParisTech

Albert Bifet is a professor and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache scalable advanced massive online analysis (SAMOA), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led massive online analysis (MOA), the most popular open source framework for data stream mining with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations. He was cochair of the industrial track at ECML PKDD, BigMine, and the data streams track at ACM SAC. He holds a PhD from BarcelonaTech.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com