Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Grounding big data: A meta-imperative

Joe Hellerstein (UC Berkeley), Vikram Sreekanti (Berkeley AMP Lab)
1:50pm–2:30pm Wednesday, 03/30/2016
Data Innovations

Location: 210 D/H
Average rating: ****.
(4.00, 7 ratings)

Prerequisite knowledge

Attendees should understand the general process of data analysis and the high-level componentry of the current generation of open source big data analytics: Hadoop, Spark, Python, etc.


Keep your eyes on the stars and your feet on the ground.
—Theodore Roosevelt

One of the key technical imperatives of our time is to harness the power of data to improve our world. Over the last decade, we have seen the potential of data for unlocking the genome, healing the sick, preventing fraud and terrorism, and helping people find love. There is ample justification for enthusiasm and pride.

At the same time, we must be introspective about our current limitations. Many data science projects are isolated wins, difficult to reproduce. Much of our software and data will not interoperate without ad hoc engineering efforts. The hard work people put into wrangling and analyzing data is often lost. In these early days of the data-driven era, our efforts to harness the power of data often remain inefficient and disjointed.

In order to follow through on the data imperative, data professionals need to “go meta,” systematically focusing our methodology, tools, and energy upon our own processes. It is time to get serious about capturing, recording, and analyzing the work people do with data and computation and the contextual human knowledge they bring to those tasks. We are missing so much rich metadata in our projects: what data we have, why we have it, how and by whom it gets used, and how all these aspects evolve over time. Nourished by this metadata, we can more systematically grow reproducible processes that can be understood, reused, improved, and adapted over time.

One key barrier to this progress is the lack of open source software to enable this work. In an effort to fill this gap, our team at UC Berkeley is building an open, vendor-neutral Metadata Services layer we call Ground. At its most basic, Ground is a modest effort to enable lightweight capture of metadata via open APIs and basic formats. However, from this starting point, analysts and communities can go much deeper: nurturing and harvesting reference metadata, finding common ground for software and data interoperability, unearthing the history and lineage of all past versions of data and analytic processes, and many other metadata-driven improvements to data-centric work.

Joe Hellerstein and Vikram Sreekanti discuss the motivation and initial design of Ground through two reference use cases at UC Berkeley: reproducibility of genomics research driven by Spark and data-centric courseware captured in Jupyter Notebooks.

Photo of Joe Hellerstein

Joe Hellerstein

UC Berkeley

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Photo of Vikram Sreekanti

Vikram Sreekanti

Berkeley AMP Lab

Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley’s computer science department, he has served as a teaching assistant at Berkeley and an intern at Cloudera and Yammer.

Comments on this page are now closed.


Vikram Sreekanti
03/31/2016 3:46am PDT

You can find the slides here:

Martin Norgrove
03/30/2016 3:47pm PDT

When will the session slides be available?