Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Architecting immediacy: The design of a high-performance, portable wrangling engine

Joe Hellerstein (UC Berkeley), Seshadri Mahalingam (Trifacta)
1:50pm–2:30pm Thursday, 03/31/2016
Data Innovations

Location: LL21 E/F
Average rating: ****.
(4.00, 3 ratings)

Prerequisite knowledge

Attendees should understand the basic concepts of how data-processing engines like Hadoop, Spark, or SQL database systems work.

Description

Traditional data-transformation tools operate in a script-execute-check loop: transformations are scripted by dragging icons onto a canvas or by writing code, the resulting script is compiled and executed over a dataset, and the results are checked for acceptability. This deliberate authoring process makes it very hard for users to understand their data and transformations and explore different ways to wrangle data into shape.

At Trifacta, we have seen that users find wrangling radically simpler and more effective when they can visualize and “play” with their data and get immediate feedback. To deliver this experience, we had to build transformation technology to execute at the speed that users think.

Seshadri Mahalingam and Joe Hellerstein discuss their high-performance data-transformation engine, Photon, which provides immediacy to the data-wrangling experience. Seshadri and Joe demonstrate how to make the most of modern processors, including the utilization of multiple cores and vector-processing capabilities, and emphasize issues that are specific to data wrangling, including heavy string manipulation, data profiling, and second-order transformations. They’ll talk about the surprising portability of C++ and LLVM and the ways they leverage those traditional technologies both in the browser and on the desktop, as well as the potential for open data interchange with other high-performance software like Impala and Drill.

Photo of Joe Hellerstein

Joe Hellerstein

UC Berkeley

Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.

Photo of Seshadri Mahalingam

Seshadri Mahalingam

Trifacta

Seshadri Mahalingam is a software engineer at Trifacta, where, in addition to building out Wrangle, Trifacta’s domain-specific language for expressing data transformation, he develops the low-latency compute framework that powers Trifacta’s fluid and immersive data wrangling experience. Seshadri holds a BS in EECS from UC Berkeley, where he cotaught a class on open source software.