Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

An adaptive execution mode for Spark SQL

Carson Wang (Intel), Yucai Yu (Intel)
12:05pm12:45pm Thursday, December 7, 2017
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Anyone interested in Spark SQL and SQL performance tuning

Prerequisite knowledge

  • Basic knowledge of or experience using Spark SQL

What you'll learn

  • Explore the internals of Spark SQL's engine and a new adaptive execution mode

Description

Spark SQL is one of the most popular components of Apache Spark. The core of Spark SQL is its catalyst optimizer, which provides both rule-based and cost-based optimization. The quality of the SQL execution plan is an important factor in Spark SQL performance. However, it is not easy to get an optimal execution plan at the planning phase. For example, a join operator may take intermediate results as input tables. For instance, Spark may choose to use sort-merge join if it doesn’t know the table size at the planning phase, but at runtime it knows the table is small enough for broadcast.

Carson Wang and Yucai Yu explore Intel’s efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL in order to enable the ability to switch to an alternative execution plan at runtime. At runtime, the adaptive execution mode can change shuffle join to broadcast join if it finds the size of one table is less than the broadcast threshold. It can also handle skewed input data for join and change the partition number of the next stage to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime. Carson and Yucai cover the internals of the Spark SQL’s engine and the design considerations and share their experience running a 100 TB-scale TPCx-BB benchmark with it.

Photo of Carson Wang

Carson Wang

Intel

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

Photo of Yucai  Yu

Yucai Yu

Intel

Yucai Yu is a software architect at Intel, where he works on Apache Spark upstream development and IA optimization. Previously, he worked at IBM and Citi Bank with a focus on OS, virtualization, storage, and data warehouses.