Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Continuous machine learning over streaming data: The story continues.

Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )
2:05pm–2:45pm Wednesday, 09/12/2018
Data science and machine learning
Location: 1A 12/14 Level: Intermediate
Secondary topics:  Retail and e-commerce, Temporal data and time-series analytics
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Machine learning scientists, data scientists, and developers

Prerequisite knowledge

  • A general understanding of streaming data and unsupervised learning

What you'll learn

  • Explore the robust random cut forest (RRCF) algorithm, which can efficiently maintain a sketch of a data stream and continuously learn (adapt) as new data streams in, as well as new applications that are of practical importance in processing real-time streaming data


Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

In this extension of their talk at Strata San Jose 2018, where they first presented the RRCF algorithm—which maintains an efficient sketch of a data stream and continuously adapts (learns) each time it sees a new data record—Roger, Sudipto, and Kapil discuss new applications and results, including implementation details. After briefly introducing the RRCF algorithm, they present its application to impute missing values in a data stream. They then detail its application to forecast future values, when the stream is a time series of data, and describe how the RRCF algorithm can be used to detect emerging hotspots in a data stream and perform multiclass classification over streaming data.

For each application of the RRCF, Roger, Sudipto, and Kapil present an actual customer use case along with the results of experiments that compare RRCF application with best-in-class methods. They conclude with a deep dive into the efficient implementation the RRCF algorithm that enables it to operate and continuously learn in real time over massive data streams.

Photo of Roger Barga

Roger Barga

Amazon Web Services

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Photo of Sudipto Guha

Sudipto Guha

Amazon Web Services

Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.

Photo of Kapil Chhabra

Kapil Chhabra

Amazon Web Services

Kapil Chhabra is a senior product manager at Amazon Web Services, focusing on real-time machine learning on high-volume and high-velocity data. He also runs the streaming data ingestion business at AWS, Kinesis Data Firehose. Previously, he led the analytics business at Akamai Technologies and launched and scaled multiple new products, including real-time video monitoring services (Media Analytics and QoS Monitor) and the award-winning broadcast operations as a service (BOCC).

Comments on this page are now closed.


Shashank Shashikant Rao | DATA SCIENTIST
09/14/2018 1:51pm EDT

Can you please upload the slides again please? I don;t see it.

Picture of Roger Barga
09/14/2018 7:18am EDT

@Mahmood, the slides have been uploaded – best, roger

Mahmood Qadir | MR
09/12/2018 10:54am EDT

Link to the slides please