Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Approximation data structures in streaming data processing

Debasish Ghosh (Lightbend)
1:50pm2:30pm Wednesday, March 7, 2018
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Architects, programmers, and CTOs

Prerequisite knowledge

  • A basic understanding of streaming data

What you'll learn

  • Learn how probabilistic data structures can help implement streaming data architectures
  • Understand the power of hashing and randomization to address the fast data problem


Debasish Ghosh explores the role that approximation data structures (Bloom filtera, sketches, HyperLogLog, etc.) play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.

Topics include:

  • Handling unbounded data streams with approximation data structures: Deterministic algorithms and data structures that operate on bounded data often fail to work with unbounded streaming data. A class of approximation data structures and algorithms fits these use cases nicely. For instance, a Bloom filter is a space-efficient probabilistic data structure for membership queries that works with sublinear space at the expense of the possibility of reporting false positives for some cases. Similarly, a count min sketch offers a count of frequencies of events using sublinear space. The percentage of tolerable false positives is always a trade-off with the space savings that you achieve with this data structure. Debasish explains how these data structures use randomization and hashing techniques to implement real-time analytics use cases on streaming data.
  • Efficiency of approximation data structures: Debasish discusses the probabilistic bounds that these data structures offer and how they can be used to efficiently implement solutions for fast and streaming data architectures. These data structures solve a range of approximate aggregate queries in an efficient manner (e.g., count unique, membership, histogram and percentile, top k frequent items, range query, and many more). And all of these are extremely useful when implementing analytics solutions on unbounded streaming data.
  • A concrete use case: Debasish concludes with a concrete use case of a customized state store that uses Bloom filters on top of Kafka Streams to solve real-world problems.
Photo of Debasish Ghosh

Debasish Ghosh


Debasish Ghosh is a principal software engineer at Lightbend. He’s passionate about technology and open source, loves functional programming, and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He’s the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fan, who loves spending time with his beautiful family.