Data analytics applications are often 10x off peak hardware performance because these applications combine multiple functions from different libraries and frameworks to build increasingly complex workflows. Even if each individual function is optimized in isolation, the cost of data movement across these functions can cause order of magnitude slowdowns. For example, even though the TensorFlow machine-learning library uses highly tuned linear algebra functions for each of its operators, workflows that combine these operators can be 16x slower than hand-tuned code. Similarly, workflows that perform relational processing in Spark SQL or pandas, numerical processing in NumPy, or a combination of these tasks spend most of their time in data movement across processing functions and could run between 2x and 30× faster if optimized end to end.
Shoumik Palkar offers an overview of Weld, an optimizing runtime for data-intensive applications that works across disjoint libraries and functions. Weld uses a common representation to capture the structure of diverse data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using a cost-based optimizer that takes into account hardware characteristics. Weld can be integrated it into a variety of widely used analytics frameworks, such as Spark SQL for relational processing, TensorFlow for machine learning, and pandas and NumPy for general data science workloads. Integrating Weld with these frameworks requires no changes to user application code. Shoumik demonstrates how Weld speeds up existing workloads in these frameworks by up to 16x and can also enable speed-ups of two orders of magnitude in applications that combine them.
Weld is planned to be open-sourced in the near future.
Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, where he works with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.