Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process. It involves:
The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes – mostly because humans (domain experts, or data owners) aren’t scalable. As such, most enterprises are “tearing their hair out” as they try to cope with data curation at scale. We call this problem “Big Data Variety.”
This talk compares three approaches to Big Data Variety:
Two case studies, one from an Information Services company and one from a Biopharmaceutical company, will showcase why the third approach to data curation at scale is the preferred option.
is an adjunct professor at MIT CSAIL and a database pioneer who has been involved with Postgres, SciDB, Vertica, VoltDB, Tamr and other database companies. He co-authored the paper “Data Curation at Scale: The Data Tamer System,”
presented at the Conference on Innovative Data Systems Research (CIDR’13).
Dr. Stonebraker specializes in database management systems and data integration, and has been a pioneer of database research and technology for more than a quarter of a century. He is the author of scores of papers in this area. He was the main architect of the INGRES relational DBMS; the object-relational DBMS, POSTGRES; and the federated data system, Mariposa; and has started nine start-up companies to commercialize these database technologies and, more recently, Big Data technologies (Vertica, VoltDB, Paradigm4, Tamr). He was recently elected to the
National Academy of Engineering and the American Academy of Arts and Sciences