Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of our billions of users.
Specifically, in Druid, there is an assumption that data is summarized on ingest. Summarization is a well-known tool in in the BI tool chest, but it introduces some loss of data fidelity because it throws away high-cardinality, “low-value” dimensions. While these dimensions might be “low-value” when looking at how they expand the number of rows in the dataset, they may hold important information that is lost when the dimensions are thrown away. (This is most often seen with uniques.)
In order to combat this problem, Yahoo developed Theta Sketches and integrated them into Druid. Theta Sketches enable us to summarize away our user-identifier column while still being able to answer questions about the number of unique users (set union), the number of users who did X and Y (set intersection), and the number of users who did X and did not do Y (set disjunction). The tradeoff for this functionality is a bit of (configurable) error on the resulting number.
Eric introduces the idea of summarization in Druid, explains Theta Sketches, and describes how to leverage Theta Sketches inside of Druid.
Eric Tschetter is the creator and one of the main contributors to Druid, an open source, real-time analytical data store. Eric is currently a distinguished engineer at Yahoo, where he works on speeding up analytics with a mix of data science and traditional BI. Eric previously worked with diabetes data at Tidepool, a nonprofit, was the VP of engineering and lead architect at Metamarkets, and has held senior engineering positions at Ning and LinkedIn. He holds bachelor’s degrees in computer science and Japanese from the University of Texas at Austin and an MS from the University of Tokyo in computer science.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.