Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Three Approaches to Scalable Data Curation

Michael Stonebraker (Tamr, Inc.)
4:00pm–4:40pm Thursday, 02/19/2015
Hadoop & Beyond
Location: 230 C
Average rating: ***..
(3.88, 8 ratings)

Data curation is the process of turning independently created enterprise data sources (structured, semi-structured data and Hadoop data lakes) and public data sources (from the Internet of Things and elsewhere) into unified data sets ready for analytics, using domain experts to guide the process. It involves:

  • Identifying data sources of interest (whether from inside or outside the
    enterprise)
  • Verifying the data (to ascertain its composition)
  • Cleaning the incoming data (for example, 99999 is not a legal ZIP code)
  • Transforming the data (for example, from European date format to US date
    format)
  • Integrating it with other data sources of interest (into a composite whole)
    and deduplicating the resulting composite data set.

The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes – mostly because humans (domain experts, or data owners) aren’t scalable. As such, most enterprises are “tearing their hair out” as they try to cope with data curation at scale. We call this big data problem one of “Big Data Variety.”

This talk compares three approaches to Big Data Variety:

  • Traditional ETL (Extract-Load-Transform) tools, performing “top-down” integration
  • Data Science tools, oriented toward individual data scientists
  • Enterprise curation tools, performing bottom-up curation

Three case studies – one from an Information Services company, one from a Biopharmaceutical company, and a third from a diversified conglomerate – will showcase why the third approach to data curation at scale is the preferred option.

Photo of Michael Stonebraker

Michael Stonebraker

Tamr, Inc.

Michael Stonebraker is an adjunct professor at MIT CSAIL and a database pioneer who has been involved with Postgres, SciDB, Vertica, VoltDB, Tamr and other database companies. He co-authored the paper “Data Curation at Scale: The Data Tamer System” : https://cs.uwaterloo.ca/~ilyas/papers/StonebrakerCIDR2013.pdf
presented at the Conference on Innovative Data Systems Research (CIDR’13).

Dr. Stonebraker specializes in database management systems and data integration, and has been a pioneer of database research and technology for more than a quarter of a century. He is the author of scores of papers in this area. He was the main architect of the INGRES relational DBMS; the object-relational DBMS, Postgres; and the federated data system, Mariposa; and has started nine start-up companies to commercialize these database technologies and, more recently, Big Data technologies (Vertica, VoltDB, Paradigm4, Tamr). He was recently elected to the National Academy of Engineering and the American Academy of Arts and Sciences