Hilary Mason (Cloudera Fast Forward Labs)
This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.
Joseph Adler (Facebook)
Today, you can store terabytes of data for pennies per GB. But just because you can store it doesn't mean that you can do anything useful with it. In this talk, we'll look at when data size becomes a problem. When is it hard to keep data current and consistent? When is it hard to learn from the data? And when does data cause privacy and security concerns?
Michael Driscoll (Metamarkets)
The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights.