Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA
Please log in

Bullet: Querying streaming data in transit with sketches

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)
2:40pm3:20pm Thursday, March 28, 2019
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Engineers, architects, product owners, managers, directors, vice presidents, and senior vice presidents

Level

Beginner

Prerequisite knowledge

  • Familiarity with streaming technologies, such as Apache Spark Streaming and Kafka (useful but not required)

What you'll learn

  • Explore Bullet
  • Understand the challenges of querying streaming data efficiently and performantly at scale without storage and performing intractable aggregations and windowing on top of it under these constraints

Description

Bullet is a lightweight, scalable open source multitenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted; Bullet queries look forward in time. These queries can filter, project, and aggregate data in transit. Bullet is also platform and framework agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using core abstractions such as Storm and Spark for the backend layer, Kafka or another messaging queue for the pub/sub layer, and so on.

Akshai Sarma and Nathan Speidel share their motivation for creating Bullet, detail its innovative architecture, and explain how sketches fit in. They then demonstrate the latest changes to Bullet on a real high-volume dataset at use in production and discuss how they dealt with the challenges of implementing intractable aggregations such as count distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data.

Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding, etc.) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high-volume data streams in today’s IoT world seems like an impossible problem. Akshai and Nathan share how they solved all this using sketches in a simple and elegant manner, comparing different approaches to show why they settled on using DataSketches and outlining their trade-offs.

Photo of Akshai Sarma

Akshai Sarma

Yahoo

Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Yahoo. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.

Photo of Nathan Speidel

Nathan Speidel

Yahoo

Nathan Speidel develops novel solutions to big data problems at Yahoo (Verizon Media Group) and works on the Audience Data ETL pipeline. He enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie, and Hive as well as new, cutting-edge open source tools like Bullet to push the limits of streaming data processing, visualization, querying, and transformation.