Keep your eyes on the stars and your feet on the ground.
One of the key technical imperatives of our time is to harness the power of data to improve our world. Over the last decade, we have seen the potential of data for unlocking the genome, healing the sick, preventing fraud and terrorism, and helping people find love. There is ample justification for enthusiasm and pride.
At the same time, we must be introspective about our current limitations. Many data science projects are isolated wins, difficult to reproduce. Much of our software and data will not interoperate without ad hoc engineering efforts. The hard work people put into wrangling and analyzing data is often lost. In these early days of the data-driven era, our efforts to harness the power of data often remain inefficient and disjointed.
In order to follow through on the data imperative, data professionals need to “go meta,” systematically focusing our methodology, tools, and energy upon our own processes. It is time to get serious about capturing, recording, and analyzing the work people do with data and computation and the contextual human knowledge they bring to those tasks. We are missing so much rich metadata in our projects: what data we have, why we have it, how and by whom it gets used, and how all these aspects evolve over time. Nourished by this metadata, we can more systematically grow reproducible processes that can be understood, reused, improved, and adapted over time.
One key barrier to this progress is the lack of open source software to enable this work. In an effort to fill this gap, our team at UC Berkeley is building an open, vendor-neutral Metadata Services layer we call Ground. At its most basic, Ground is a modest effort to enable lightweight capture of metadata via open APIs and basic formats. However, from this starting point, analysts and communities can go much deeper: nurturing and harvesting reference metadata, finding common ground for software and data interoperability, unearthing the history and lineage of all past versions of data and analytic processes, and many other metadata-driven improvements to data-centric work.
Joe Hellerstein and Vikram Sreekanti discuss the motivation and initial design of Ground through two reference use cases at UC Berkeley: reproducibility of genomics research driven by Spark and data-centric courseware captured in Jupyter Notebooks.
Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.
Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley’s computer science department, he has served as a teaching assistant at Berkeley and an intern at Cloudera and Yammer.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.