Bullet is a lightweight, scalable open source multitenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted; Bullet queries look forward in time. These queries can filter, project, and aggregate data in transit. Bullet is also platform and framework agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using core abstractions such as Storm and Spark for the backend layer, Kafka or another messaging queue for the pub/sub layer, and so on.
Akshai Sarma and Michael Natkovich share their motivation for creating Bullet, detail its innovative architecture, and explain how sketches fit in. They then demonstrate the latest changes to Bullet on a real high-volume dataset at use in production and discuss how they dealt with the challenges of implementing intractable aggregations such as count distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data.
Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding, etc.) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high-volume data streams in today’s IoT world seems like an impossible problem. Akshai and Michael share how they solved all this using sketches in a simple and elegant manner, comparing different approaches to show why they settled on using DataSketches and outlining their trade-offs.
Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Oath. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.
Nathan Speidel develops novel solutions to Big Data problems at Yahoo (Verizon Media Group). Working on the Audience Data ETL pipeline he enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie and Hive, as well as new, cutting-edge, open source tools like Bullet to push the limits of streaming data processing, visualization, querying and transformation.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com