Traditional data-transformation tools operate in a script-execute-check loop: transformations are scripted by dragging icons onto a canvas or by writing code, the resulting script is compiled and executed over a dataset, and the results are checked for acceptability. This deliberate authoring process makes it very hard for users to understand their data and transformations and explore different ways to wrangle data into shape.
At Trifacta, we have seen that users find wrangling radically simpler and more effective when they can visualize and “play” with their data and get immediate feedback. To deliver this experience, we had to build transformation technology to execute at the speed that users think.
Seshadri Mahalingam and Joe Hellerstein discuss their high-performance data-transformation engine, Photon, which provides immediacy to the data-wrangling experience. Seshadri and Joe demonstrate how to make the most of modern processors, including the utilization of multiple cores and vector-processing capabilities, and emphasize issues that are specific to data wrangling, including heavy string manipulation, data profiling, and second-order transformations. They’ll talk about the surprising portability of C++ and LLVM and the ways they leverage those traditional technologies both in the browser and on the desktop, as well as the potential for open data interchange with other high-performance software like Impala and Drill.
Joseph M. Hellerstein is the Jim Gray Chair of Computer Science at UC Berkeley and cofounder and CSO at Trifacta. Joe’s work focuses on data-centric systems and the way they drive computing. He is an ACM fellow, an Alfred P. Sloan fellow, and the recipient of three ACM-SIGMOD Test of Time awards for his research. He has been listed by Fortune among the 50 smartest people in technology, and MIT Technology Review included his work on their TR10 list of the 10 technologies most likely to change our world.
Seshadri Mahalingam is a software engineer at Trifacta, where, in addition to building out Wrangle, Trifacta’s domain-specific language for expressing data transformation, he develops the low-latency compute framework that powers Trifacta’s fluid and immersive data wrangling experience. Seshadri holds a BS in EECS from UC Berkeley, where he cotaught a class on open source software.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.