Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Big data at a crossroads: Time to go meta (on use)

Joe Hellerstein (UC Berkeley)
11:20am–12:00pm Wednesday, 09/30/2015
Data Innovations
Location: 1 E18 / 1 E19 Level: Intermediate
Average rating: ****.
(4.22, 18 ratings)
Slides:   external link

In its early days, Hadoop was notable as a vibrant open source community, which rallied around a common architectural hypothesis and an open software ecosystem. More recently, Hadoop has evolved into a marketplace, supporting a growing variety of goods and services. As the Hadoop market scales, incentives shift toward increased differentiation, with various parties seeking to introduce unique value in the ecosystem. These incentives are generally positive forces, driving innovation both within the open source core and in value-added proprietary solutions. But the pressures of differentiation also bring risks of fragmentation, with the potential for isolated and incompatible competing components that could eventually stymie the growth of innovative uses of data.

To maintain a degree of cohesion and continue to incentivize innovation, the Hadoop ecosystem requires an agreed-upon medium for interoperation among software components, and collaboration among users. In a data-rich environment, this medium is provided by metadata services. Over the past six months, we have witnessed rapidly-growing community interest—from both non-profit and commercial parties—in a new generation of open metadata services for the Hadoop ecosystem.

Metadata services need to be significantly revisited in the evolving big data context, which is so often focused on agile analytics and structure-on-use. In 20th century enterprises, metadata was typically a deeply engineered artifact—the blueprint for a enterprise-wise data edifice, painstakingly designed in advance of construction to provide a “golden master” of enterprise truth. 20th century metadata management software was designed for this engineering philosophy, and its associated waterfall engineering processes.

In modern big data systems, the lion’s share of metadata arises through agile work processes, and needs to be managed in new ways. It is true that some metadata will continue to be produced by design: this includes metadata for core system functionality including security metadata about users, permissions, and access control. However, the bulk of valuable metadata in the modern context will be generated as metadata on use: emergent, contextual information that arises naturally when data is assessed and analyzed for use.

To get a sense of this distinction, consider a typical scenario of how metadata can be generated in the lifecycle of modern data-driven business; similar scenarios arise in scientific applications. An innovative consumer electronics company decides to leverage usage logs from their devices to understand customer behavior, improve product usability, and offer new differentiated features to their customers. This effort begins with aggressive data wrangling: gathering raw logs across a variety of devices, teasing apart their structure and content, assessing and remediating data quality, and blending the multiple logs to enable analyses across users, devices, time and geography. The details that are uncovered need to be logged, and will not have the shape and style of traditional data. For example, some metadata may describe rich detail about the data content. (“The clock for product XYZ gets reset on hardware restart, so timestamps should not be trusted until GPS turns on and sets the clock”.) Some metadata may describe usage and expertise (“DJ Patil worked on this data at timestamp T; see file foo.ipynb in github for ways to chart it”.) Some may describe data lineage, a topic of great interest to scientists interested in reproducibility of data-centric experiments. (“This data set was generated via an Oozie job described in file bar.xml, using data sources Q, R and S”). These are just a few examples of the breadth of rich metadata that emerges organically through usage. Note that while the examples were described above in prose, 21st century metadata will inevitably be a mixture of human-generated annotations and information generated by an expanding ecosystem of software tools.

Most importantly, note that some of the key content and structure of this metadata cannot be anticipated in advance: it is a product of the exploratory process of data analysis, and is inherently metadata-on-use. This is a very good thing: data is best understood by an organization when the data is actively being worked. Analysts know their data best when they are hip-deep in it, doing aggressive data wrangling and interactive, experimental analytics. Without breaking flow, the analyst—with assistance and automation from the software they work with—should be able to fluidly capture and store information about both the data, and the manner and context in which it is being used.

To support this fluid, unanticipated generation of knowledge, metadata services must be able to continuously support new users, new data sources, new types of metadata and new software components. At the same time, they have to provide an environment in which people—and software—can add value over time: mining, culling and organizing metadata in accordance with its utility, measured both in grass-roots terms (e.g., via frequency of use) and strategic measure (e.g., stated value to the organization).

In order to succeed in the Hadoop environment, a new metadata service needs to meet basic criteria of interoperability and openness suited to metadata on use. The most important of these criteria can be derived from previously successful systems like HDFS:

  • It needs to be an open-source, vendor-neutral project.
  • It needs to provide a minimum of functionality and a maximum of flexibility, to leave opportunities for a broad range of unanticipated uses and value-added services.
  • It needs to scale out arbitrarily, both in volume and in workload; experience shows that metadata services can be big data problems in their own right.

In this talk, we will describe the need for metadata-on-use services in the big data context, and the reason why various constituencies in the community benefit from “going meta” in an open way. We will explain the need for metadata services with use cases in both big science and enterprise deployments. Finally, we will lay out the design challenges and opportunities endemic to systems supporting metadata-on-use.

Photo of Joe Hellerstein

Joe Hellerstein

UC Berkeley

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Comments on this page are now closed.


Picture of Joe Hellerstein
Joe Hellerstein
07/10/2015 4:05pm EDT

Apparently I was not aware of the overstrike formatting in this interface. I do indeed look forward to meeting up!

Picture of Joe Hellerstein
Joe Hellerstein
07/10/2015 4:04pm EDT

Joe - I look forward to meeting up. - Joe

Picture of Joe Witt
Joe Witt
07/10/2015 6:54am EDT

This is an excellent writeup. Very much hope to make it to your talk. Some of these very concepts are built into Apache NiFi (incubating). At the upcoming OSCON I will be describing the power of data provenance and the cool user experience and system features it enables when you retain context combined with content. The talk will include a demo of provenance from a datasource such as twitter into both streaming and batch analytic systems in parallel. Whether the data is being joined or forked it doesnt matter – provenance allows you to retain the contextual trail.

That talk is linked here:

You can learn more about NiFi and its support for extremely powerful data provenance here: