This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.
Today, you can store terabytes of data for pennies per GB. But just
because you can store it doesn't mean that you can do anything useful
with it. In this talk, we'll look at when data size becomes a problem.
When is it hard to keep data current and consistent? When is it hard
to learn from the data? And when does data cause privacy and security
concerns?
The world is experiencing an Industrial Revolution of Data. In any
given minute the machines around us are tracking billions of mouse
clicks, credit card swipes, and GPS coordinates. And increasingly
this data is being saved, aggregated, and analyzed. These massive
data flows present big challenges to firms, but also new opportunities
for deriving insights.