San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Bullet: Querying streaming data in transit with sketches

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)

2:40pm–3:20pm Thursday, March 28, 2019

Data Engineering & Architecture
Location: 2006

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Average rating:

(3.67, 3 ratings)

Who is this presentation for?

Engineers, architects, product owners, managers, directors, vice presidents, and senior vice presidents

Level

Beginner

Prerequisite knowledge

Familiarity with streaming technologies, such as Apache Spark Streaming and Kafka (useful but not required)

What you'll learn

Explore Bullet
Understand the challenges of querying streaming data efficiently and performantly at scale without storage and performing intractable aggregations and windowing on top of it under these constraints

Description

Bullet is a lightweight, scalable open source multitenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted; Bullet queries look forward in time. These queries can filter, project, and aggregate data in transit. Bullet is also platform and framework agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using core abstractions such as Storm and Spark for the backend layer, Kafka or another messaging queue for the pub/sub layer, and so on.

Akshai Sarma and Nathan Speidel share their motivation for creating Bullet, detail its innovative architecture, and explain how sketches fit in. They then demonstrate the latest changes to Bullet on a real high-volume dataset at use in production and discuss how they dealt with the challenges of implementing intractable aggregations such as count distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data.

Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding, etc.) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high-volume data streams in today’s IoT world seems like an impossible problem. Akshai and Nathan share how they solved all this using sketches in a simple and elegant manner, comparing different approaches to show why they settled on using DataSketches and outlining their trade-offs.

Akshai Sarma

Yahoo

Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Yahoo. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.

Nathan Speidel

Yahoo

Nathan Speidel develops novel solutions to big data problems at Yahoo (Verizon Media Group) and works on the Audience Data ETL pipeline. He enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie, and Hive as well as new, cutting-edge open source tools like Bullet to push the limits of streaming data processing, visualization, querying, and transformation.

Website

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com