If you’re monitoring a larger volume of data than you can collect and keep, there are generally two ways to describe the data compactly: generate metrics about it and keep the metrics, or take samples of the data. At VividCortex we do a blend of both. Part of our data is lots of queries arriving at database servers. We classify the queries and generate metrics about each class, and we take samples from each class, because it’s hard to optimize queries if you have no samples to inspect! Blending metrics and samples gives good insight. But for our use case, sampling turned out to be really hard to do well. The solution should be interesting to anyone who samples from a stream of events.
Our problem, broadly speaking, was how to sample the right amount of data from the blend of classes of queries, with the desired distribution in time. There are lots of edge cases that can cause various types of problems including our agents DDOS-ing our APIs. These include high cardinality, high velocity, outliers, and more. And there are lots of requirements, too, including avoiding floods, making sure we get samples, rate-limiting individual classes and the whole, and the like. Many of the obvious ideas are quite bad along one or more of these criterion. For example, capturing the “worst” sample in a given time period gives a badly skewed view of the stream, and it’s easy to find or construct a case where sampling consumes lots of memory and/or CPU. (Ask us how we know that.)
Our solution uses a sketch, which is a probabilistic data structure. Common examples include HyperLogLog, Count-Min, and Bloom Filters. In this talk we’ll explain how we created a specialized sketch to achieve our ideal sampling behavior. We’ll cover our goals, approaches we tried previously, their results and sometimes-dramatic failings, and the sketch and its results, as well as why and how it works. We will open-source the sketch’s code under the MIT license.
Baron Schwartz is the founder and CTO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software and several books, including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally, including the view that teams are systems and culture influences their performance, and databases specifically.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com