Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Weld: Accelerating data science by 100x

Shoumik Palkar (Stanford University), Matei Zaharia (Stanford University)
4:35pm5:15pm Wednesday, September 27, 2017
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 08/10 Level: Intermediate
Secondary topics:  Pydata
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Data scientists

Prerequisite knowledge

  • Familiarity with Python and the PyData stack
  • Experience doing data science

What you'll learn

  • Explore Weld, a new open source project from Stanford to accelerate data-intensive applications by as much as 100x

Description

Weld is a new open source project from Stanford that accelerates data-intensive applications by as much as 100x by optimizing across functions within a single library as well as across different libraries, so developers can write modular code and still get close to bare-metal performance without incurring expensive data-movement costs. Weld uses a common representation to capture the structure of data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using a cost-based optimizer that takes into account hardware characteristics. Weld contains APIs in Python and C and can be integrated it into a variety of widely used analytics frameworks such as Spark SQL, TensorFlow, and pandas.

Shoumik Palkar and Matei Zaharia offer an overview of Weld’s architecture and internals and demonstrate how Weld can be incrementally integrated into these libraries by porting only the most impactful operators first without breaking compatibility with other operators in the library and without changing the API of the libraries (so users do not need to change their application code). Shoumik and Matei also explain how Weld speeds up existing workloads using libraries such as pandas, NumPy, and Spark SQL by up to 300x when workloads combine these libraries and explore Grizzly, a Weld-integrated version of the pandas framework.

Photo of Shoumik Palkar

Shoumik Palkar

Stanford University

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, where he works with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Photo of Matei Zaharia

Matei Zaharia

Stanford University

Matei Zaharia is an assistant professor in the Computer Science Department at Stanford, where he works on computer systems and big data.