Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Agile Data Profiling in the Big Data Era

Adam Silberstein (Trifacta), Joe Hellerstein (UC Berkeley)
11:30am–12:10pm Thursday, 02/19/2015
Data Science
Location: LL20 A
Average rating: ****.
(4.33, 12 ratings)

The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization.

In the Big Data era, most of these steps need to be revisited. First, “the numbers” are often not evident in the raw data; instead, data transformation tasks extract features from the raw data, and those features—which are often derived in an ad hoc way for specific analytics tasks—provide the inputs for profiling. Second, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used, and care needs to be taken that multi-hour batch jobs actually do useful work. Finally, the output of a single data profiling run is often only the beginning of an iterative process: based on a profile, the choice of features and transformations often needs to change.

In this talk we’ll cover technical challenges in making data profiling agile in the Big Data era. We’ll discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches for different profiling needs in a Big Data context. We’ll discuss considerations for using Hadoop technologies for data profiling, and some of the pitfalls from our experience working in the contexts of both massive Internet services, and end-user profiling tools. Finally, we’ll look at higher-level DSLs and visual interfaces that allow users to declare their needs effectively, scope the behavior of the underlying techniques, and assess the results of profiling.

Photo of Adam Silberstein

Adam Silberstein

Trifacta

Adam Silberstein is a lead software engineer at Trifacta. His main area of interest is large-scale data processing, including in the batch processing and online serving spaces. His work has appeared in top database venues such as SIGMOD, VLDB, and ICDE. Prior to joining Trifacta, Adam was a Staff Software Engineer at LinkedIn in and a Research Scientist at Yahoo! Research. He completed his PhD at Duke University in 2007.

Photo of Joe Hellerstein

Joe Hellerstein

UC Berkeley

Joseph M. Hellerstein is a Chief Strategy Officer at Trifacta and Chancellor’s Professor of Computer Science at UC Berkeley. His work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Fellow and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune Magazine among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.