Spark SQL is one of the most popular components of Apache Spark. The core of Spark SQL is its catalyst optimizer, which provides both rule-based and cost-based optimization. The quality of the SQL execution plan is an important factor in Spark SQL performance. However, it is not easy to get an optimal execution plan at the planning phase. For example, a join operator may take intermediate results as input tables. For instance, Spark may choose to use sort-merge join if it doesn’t know the table size at the planning phase, but at runtime it knows the table is small enough for broadcast.
Carson Wang and Yucai Yu explore Intel’s efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL in order to enable the ability to switch to an alternative execution plan at runtime. At runtime, the adaptive execution mode can change shuffle join to broadcast join if it finds the size of one table is less than the broadcast threshold. It can also handle skewed input data for join and change the partition number of the next stage to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime. Carson and Yucai cover the internals of the Spark SQL’s engine and the design considerations and share their experience running a 100 TB-scale TPCx-BB benchmark with it.
Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He’s an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.
Yucai Yu is a software architect at Intel, where he works on Apache Spark upstream development and IA optimization. Previously, he worked at IBM and Citi Bank with a focus on OS, virtualization, storage, and data warehouses.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com