Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Approximation data structures in streaming data processing

Debasish Ghosh (Lightbend)

1:50pm–2:30pm Wednesday, March 7, 2018

Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Average rating:

(3.33, 3 ratings)

Download slides (PDF)

Who is this presentation for?

Architects, programmers, and CTOs

Prerequisite knowledge

A basic understanding of streaming data

What you'll learn

Learn how probabilistic data structures can help implement streaming data architectures
Understand the power of hashing and randomization to address the fast data problem

Description

Debasish Ghosh explores the role that approximation data structures (Bloom filtera, sketches, HyperLogLog, etc.) play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.

Topics include:

Handling unbounded data streams with approximation data structures: Deterministic algorithms and data structures that operate on bounded data often fail to work with unbounded streaming data. A class of approximation data structures and algorithms fits these use cases nicely. For instance, a Bloom filter is a space-efficient probabilistic data structure for membership queries that works with sublinear space at the expense of the possibility of reporting false positives for some cases. Similarly, a count min sketch offers a count of frequencies of events using sublinear space. The percentage of tolerable false positives is always a trade-off with the space savings that you achieve with this data structure. Debasish explains how these data structures use randomization and hashing techniques to implement real-time analytics use cases on streaming data.
Efficiency of approximation data structures: Debasish discusses the probabilistic bounds that these data structures offer and how they can be used to efficiently implement solutions for fast and streaming data architectures. These data structures solve a range of approximate aggregate queries in an efficient manner (e.g., count unique, membership, histogram and percentile, top k frequent items, range query, and many more). And all of these are extremely useful when implementing analytics solutions on unbounded streaming data.
A concrete use case: Debasish concludes with a concrete use case of a customized state store that uses Bloom filters on top of Kafka Streams to solve real-world problems.

Debasish Ghosh

Lightbend

Debasish Ghosh is a principal software engineer at Lightbend. He’s passionate about technology and open source, loves functional programming, and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He’s the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fan, who loves spending time with his beautiful family.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com