Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Bullet: Querying streaming data in transit with sketches

Akshai Sarma (Oath), Nathan Speidel (Yahoo)
2:40pm3:20pm Thursday, March 28, 2019
Secondary topics:  Storage, Streaming, realtime analytics, and IoT

Who is this presentation for?

  • Engineers, architects, product owners, managers, directors, vice presidents, and senior vice presidents

Level

Beginner

Prerequisite knowledge

  • Familiarity with streaming technologies, such as Apache Spark Streaming and Kafka (useful but not required)

What you'll learn

  • Explore Bullet
  • Understand the challenges of querying streaming data efficiently and performantly at scale without storage and performing intractable aggregations and windowing on top of it under these constraints

Description

Bullet is a lightweight, scalable open source multitenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted; Bullet queries look forward in time. These queries can filter, project, and aggregate data in transit. Bullet is also platform and framework agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using core abstractions such as Storm and Spark for the backend layer, Kafka or another messaging queue for the pub/sub layer, and so on.

Akshai Sarma and Michael Natkovich share their motivation for creating Bullet, detail its innovative architecture, and explain how sketches fit in. They then demonstrate the latest changes to Bullet on a real high-volume dataset at use in production and discuss how they dealt with the challenges of implementing intractable aggregations such as count distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data.

Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding, etc.) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high-volume data streams in today’s IoT world seems like an impossible problem. Akshai and Michael share how they solved all this using sketches in a simple and elegant manner, comparing different approaches to show why they settled on using DataSketches and outlining their trade-offs.

Photo of Akshai Sarma

Akshai Sarma

Oath

Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Oath. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.

Photo of Nathan Speidel

Nathan Speidel

Yahoo

Nathan Speidel develops novel solutions to Big Data problems at Yahoo (Verizon Media Group). Working on the Audience Data ETL pipeline he enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie and Hive, as well as new, cutting-edge, open source tools like Bullet to push the limits of streaming data processing, visualization, querying and transformation.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)