Skip to main content

Not Exactly! Fast Queries via Approximation Algorithms

Fangjin Yang (Imply), Nelson Ray (Metamarkets)
Hardcore Data Science Gramercy Suite
Average rating: ****.
(4.00, 8 ratings)
Slides:   1-ZIP 

While exact queries sound like a good idea, they are not ideal in all cases. For some situations, approximation is more efficient and effective. Calculating the cardinality exactly of 1 billion unique integer-valued IDs requires 4GB of storage. Algorithms such as HyperLogLog enable estimates within 2% for cardinalities exceeding 1 billion using only 1.5KB of storage.

Data stored in Druid typically undergoes a timestamp truncation to a desired level of granularity (e.g. 1 hour) and then a GroupBy/aggregation step that summarizes data. Certain “distributive” statistics (count, sum, min, max) can be summarized and computed exactly. For non-distributive statistics such as cardinality and quantiles, we store lossy representations of the data. These sketches enable fast, approximate calculations of the desired statistics.

We describe our use of and modifications to HyperLogLog for cardinality estimation, streaming parallel decision trees for quantile estimation and histogram building, and approximate top-k algorithms.

Photo of Fangjin Yang

Fangjin Yang

Imply

One of Metamarkets’ first developers on Druid, their database platform, Fangjin is responsible for core infrastructure development including real-time data ingestion. He joined Metamarkets from Cisco where he optimized packet diagnostic algorithms for Cisco’s flagship Cat6k router. Previous to this Fangjin held various engineering and architecture roles at Ericsson and Barclay’s Capital. He holds a BASc in Electrical Engineering and a MASc in Computer Engineering from the University of Waterloo, Canada.

Photo of Nelson Ray

Nelson Ray

Metamarkets

Nelson Ray develops data-analytical algorithms as a software engineer at Metamarkets. He holds a B.S., M.S., and Ph.D. in statistics from Stanford University, where he wrote his thesis on the use of machine learning techniques for statistical inference. He has prior experience running large-scale, adaptive experiments at Facebook.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts