Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Three Approaches to Scalable Data Curation

11:00am–11:40am Thursday, 10/16/2014
Hadoop & Beyond
Location: 1 E20/1 E21
Average rating: ***..
(3.67, 12 ratings)
Slides:   1-PPTX 

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process. It involves:

  • Identifying data sources of interest (whether from inside or outside the enterprise)
  • Verifying the data (to ascertain its composition)
  • Cleaning the incoming data (for example, 99999 is not a legal zip code)
  • Transforming the data (for example, from European date format to US date format)
  • Integrating it with other data sources of interest (into a composite whole)
  • Deduplicating the resulting composite data set.

The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes – mostly because humans (domain experts, or data owners) aren’t scalable. As such, most enterprises are “tearing their hair out” as they try to cope with data curation at scale. We call this problem “Big Data Variety.”

This talk compares three approaches to Big Data Variety:

  • ETL (Extract-Transform-Load) tools
  • Data Science tools
  • Enterprise curation tools

Two case studies, one from an Information Services company and one from a Biopharmaceutical company, will showcase why the third approach to data curation at scale is the preferred option.

Photo of Michael Stonebraker

Michael Stonebraker

Tamr

Michael Stonebraker
is an adjunct professor at MIT CSAIL and a database pioneer who has been involved with Postgres, SciDB, Vertica, VoltDB, Tamr and other database companies. He co-authored the paper “Data Curation at Scale: The Data Tamer System,”
presented at the Conference on Innovative Data Systems Research (CIDR’13).

Dr. Stonebraker specializes in database management systems and data integration, and has been a pioneer of database research and technology for more than a quarter of a century. He is the author of scores of papers in this area. He was the main architect of the INGRES relational DBMS; the object-relational DBMS, POSTGRES; and the federated data system, Mariposa; and has started nine start-up companies to commercialize these database technologies and, more recently, Big Data technologies (Vertica, VoltDB, Paradigm4, Tamr). He was recently elected to the
National Academy of Engineering and the American Academy of Arts and Sciences