Scaling by Cheating: Approximation, Sampling and Fault-friendliness for Scalable Big Learning

Data Science
Location: King's Suite - Balmoral Level: Intermediate
Average rating: ***..
(3.17, 6 ratings)
Slides:   1-PPTX 

Today, large-scale data analysis and learning faces two large and
opposing obstacles: process ever more data, but keep processing it
faster. Today’s gigabytes and seconds is tomorrow’s terabytes and
milliseconds. Fortunately we have cheap computing resources and mature
frameworks like Hadoop, but we need a second “secret weapon” to keep
up: cheating.

Getting exact answers is not always required. At extreme scale, the
time wasted in finding exact rather than close-enough answers is also
extreme. “Cheating”, to get an approximate answer with much less time
and resource, becomes an appealing tool.

In this talk, we’ll show how simple examples, like finding an average,
can be greatly sped up by correctly deciding when it’s near enough.
We’ll show how even Hadoop’s simple “Word Count” program can be run
several times faster with almost no noticeable loss of accuracy, with
some careful application of this principle.

Finally we’ll show how sampling is used intelligently to make
infeasibly large computations feasible in Mahout, and how a
‘fault-friendly’ distributed architecture using ML tools from Cloudera
could gain scale and simplicity by accepting a small error rate.

Photo of Sean Owen

Sean Owen


Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Comments on this page are now closed.


Richard Zaresbki
28/11/2013 12:15 GMT

ignore my comment! antivirus problem. works now! cheers

Richard Zaresbki
28/11/2013 12:07 GMT

hi there, would it be possible to upload the slides again? they download but throw and error when trying to open. Regards Rich.


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts