Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Analyzing billions of users with Druid and Theta Sketches

Eric Tschetter (Yahoo)
11:00am–11:40am Wednesday, 03/30/2016
Data Innovations

Location: 210 D/H
Average rating: ****.
(4.29, 7 ratings)

Prerequisite knowledge

Attendees should have a general understanding of data modeling in databases.


Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of our billions of users.

Specifically, in Druid, there is an assumption that data is summarized on ingest. Summarization is a well-known tool in in the BI tool chest, but it introduces some loss of data fidelity because it throws away high-cardinality, “low-value” dimensions. While these dimensions might be “low-value” when looking at how they expand the number of rows in the dataset, they may hold important information that is lost when the dimensions are thrown away. (This is most often seen with uniques.)

In order to combat this problem, Yahoo developed Theta Sketches and integrated them into Druid. Theta Sketches enable us to summarize away our user-identifier column while still being able to answer questions about the number of unique users (set union), the number of users who did X and Y (set intersection), and the number of users who did X and did not do Y (set disjunction). The tradeoff for this functionality is a bit of (configurable) error on the resulting number.

Eric introduces the idea of summarization in Druid, explains Theta Sketches, and describes how to leverage Theta Sketches inside of Druid.

Photo of Eric Tschetter

Eric Tschetter


Eric Tschetter is the creator and one of the main contributors to Druid, an open source, real-time analytical data store. Eric is currently a distinguished engineer at Yahoo, where he works on speeding up analytics with a mix of data science and traditional BI. Eric previously worked with diabetes data at Tidepool, a nonprofit, was the VP of engineering and lead architect at Metamarkets, and has held senior engineering positions at Ning and LinkedIn. He holds bachelor’s degrees in computer science and Japanese from the University of Texas at Austin and an MS from the University of Tokyo in computer science.