Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Weld: An optimizing runtime for high-performance data analytics

Shoumik Palkar (Stanford University)
11:50am12:30pm Thursday, March 16, 2017

What you'll learn

  • Explore Weld, an optimizing runtime for data-intensive applications that works across disjoint libraries and functions

Description

Data analytics applications are often 10x off peak hardware performance because these applications combine multiple functions from different libraries and frameworks to build increasingly complex workflows. Even if each individual function is optimized in isolation, the cost of data movement across these functions can cause order of magnitude slowdowns. For example, even though the TensorFlow machine-learning library uses highly tuned linear algebra functions for each of its operators, workflows that combine these operators can be 16x slower than hand-tuned code. Similarly, workflows that perform relational processing in Spark SQL or pandas, numerical processing in NumPy, or a combination of these tasks spend most of their time in data movement across processing functions and could run between 2x and 30× faster if optimized end to end.

Shoumik Palkar offers an overview of Weld, an optimizing runtime for data-intensive applications that works across disjoint libraries and functions. Weld uses a common representation to capture the structure of diverse data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using a cost-based optimizer that takes into account hardware characteristics. Weld can be integrated it into a variety of widely used analytics frameworks, such as Spark SQL for relational processing, TensorFlow for machine learning, and pandas and NumPy for general data science workloads. Integrating Weld with these frameworks requires no changes to user application code. Shoumik demonstrates how Weld speeds up existing workloads in these frameworks by up to 16x and can also enable speed-ups of two orders of magnitude in applications that combine them.

Weld is planned to be open-sourced in the near future.

Photo of Shoumik Palkar

Shoumik Palkar

Stanford University

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, where he works with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.