Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.
In this extension of their talk at Strata San Jose 2018, where they first presented the RRCF algorithm—which maintains an efficient sketch of a data stream and continuously adapts (learns) each time it sees a new data record—Roger, Sudipto, and Kapil discuss new applications and results, including implementation details. After briefly introducing the RRCF algorithm, they present its application to impute missing values in a data stream. They then detail its application to forecast future values, when the stream is a time series of data, and describe how the RRCF algorithm can be used to detect emerging hotspots in a data stream and perform multiclass classification over streaming data.
For each application of the RRCF, Roger, Sudipto, and Kapil present an actual customer use case along with the results of experiments that compare RRCF application with best-in-class methods. They conclude with a deep dive into the efficient implementation the RRCF algorithm that enables it to operate and continuously learn in real time over massive data streams.
Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.
Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.
Kapil Chhabra is a senior product manager at Amazon Web Services, focusing on real-time machine learning on high-volume and high-velocity data. He also runs the streaming data ingestion business at AWS, Kinesis Data Firehose. Previously, he led the analytics business at Akamai Technologies and launched and scaled multiple new products, including real-time video monitoring services (Media Analytics and QoS Monitor) and the award-winning broadcast operations as a service (BOCC).
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com
Comments
Can you please upload the slides again please? I don;t see it.
@Mahmood, the slides have been uploaded – best, roger
Link to the slides please