Skip to main content

Making Big Data Small

Baron Schwartz (VividCortex)
Data-Driven Business Beekman Parlor -- Sutton North
Average rating: ****.
(4.80, 5 ratings)
Slides:   external link

Today it almost seems fashionable to capture, store, and process “everything,” because we can. But there’s a real cost to this approach — and in many cases, the ultimate goal might be served nearly as well by a Small Data mindset and worldview.

In this session I will share my tricks for reducing a lot of problems from a Big Data, Big Compute solution to a comparatively small and cheap approach instead. The savings can be as big as you want them to be, including “infinite” (yes, with air-quotes) in some cases. Not every problem is amenable to this kind of solution, but many are.

In general, data collection, storage, retrieval, and processing can all be characterized by the cost and resources required for storage, bandwidth, and computation. Each of these often offers opportunities for a cost-versus-accuracy tradeoff. Consider Bloom Filters, for example, which answer a yes-no question with either “probably yes” or “definitely no” and are extremely cheap relative to the cost of a “definitely yes/no” answer.

If you’re not familiar with Bloom Filters, I’ll cover that, as well as a variety of other techniques, such as exponential moving averages, discarding strong correlates, pre-filtering, sparse collection and storage, histograms, statistical metrics, sampling, and modeling. Each of these offers a tradeoff that’s worth considering.

In addition, I’ll share my general approach to finding Small Data solutions to all kinds of Big Data problems. I don’t have a fancy name for it, but I do have a process that works well for me, and I believe it may be useful to you too.

Photo of Baron Schwartz

Baron Schwartz


Baron Schwartz is the founder and CTO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software and several books, including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally, including the view that teams are systems and culture influences their performance, and databases specifically.

Comments on this page are now closed.


Marek K Kolodziej
10/30/2013 4:36pm EDT

Would it be possible to post the slides here, like the other speakers have?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts