Bullet is a lightweight, scalable open source multitenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted; Bullet queries look forward in time. These queries can filter, project, and aggregate data in transit. Bullet is also platform and framework agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using core abstractions such as Storm and Spark for the backend layer, Kafka or another messaging queue for the pub/sub layer, and so on.
Akshai Sarma and Nathan Speidel share their motivation for creating Bullet, detail its innovative architecture, and explain how sketches fit in. They then demonstrate the latest changes to Bullet on a real high-volume dataset at use in production and discuss how they dealt with the challenges of implementing intractable aggregations such as count distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data.
Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding, etc.) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high-volume data streams in today’s IoT world seems like an impossible problem. Akshai and Nathan share how they solved all this using sketches in a simple and elegant manner, comparing different approaches to show why they settled on using DataSketches and outlining their trade-offs.
Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Yahoo. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.
Nathan Speidel develops novel solutions to big data problems at Yahoo (Verizon Media Group) and works on the Audience Data ETL pipeline. He enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie, and Hive as well as new, cutting-edge open source tools like Bullet to push the limits of streaming data processing, visualization, querying, and transformation.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com