This presentation will review and discuss common data problems encountered with web-sourced data, such as content cleaning, duplicate detection, clustering, and classification and describe the algorithms that work best as the volume of data increases, along with hacks for getting high-quality results as quickly as possible.
Hilary Mason is founder and CEO of Fast Forward Labs, a machine intelligence research company, and data scientist in residence at Accel Partners. Previously Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s home-grown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.