Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

MacroBase: A Search Engine for Fast Data Streams

Sahaana Suri (Stanford University)
4:35pm5:15pm Thursday, September 28, 2017
Stream processing and analytics
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming

Who is this presentation for?

Data engineers, Devops, Data scientists (ML expertise not needed)

Prerequisite knowledge

Basic understanding of data analytics pipelines; bonus: production experience with time-series, in observability, or data science

What you'll learn

It’s possible to do much better than simple rule-based classification and offline, ad-hoc root cause analysis by combining ML-powered classification and explanation operators; MacroBase provides a reference architecture.

Description

MacroBase is a new open source analytics engine from the Stanford InfoLab designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. In many deployments at scale, an overwhelming proportion of data collected is never read and is instead retained only for reactive failure analysis. In response, MacroBase analyzes data as it arrives, providing high-level interpretable explanations of stream behaviors, thus increasing its utility and enabling real-time, root cause analysis and anomaly detection.

At its core, MacroBase combines streaming classification and explanation operators to both identify individual points of interest as well as highlight commonalities across them. For example, the Android device ecosystem comprises over 24,000 distinct device types; is a mobile application behaving correctly on all of them? MacroBase’s classification operators can identify abnormally behaving devices, while its explanation operators can aggregate many such devices, producing more interpretable outputs. Thus, MacroBase is designed as both a set of reconfigurable dataflow operators as well as a series of end-to-end dataflow pipelines that have already been used to diagnose issues in production streams in mobile, data center, and industrial applications.

We will walk through the core concepts behind MacroBase, its architecture, key use cases, and takeaways from the recent research literature for data scientists, data engineers, and DevOps.

Photo of Sahaana Suri

Sahaana Suri

Stanford University

Sahaana Suri is a second year PhD student in the Stanford InfoLab, working with Peter Bailis. Sahaana’s research focuses on building easy-to-use, accessible data analytics and machine learning systems that scale. She holds a bachelor’s degree in Electrical Engineering and Computer Science from the University of California, Berkeley.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)