Presented by O'Reilly and Cloudera
Make Data Work
July 12-13, 2017: Training
July 13-15, 2017: Tutorials & Conference
Beijing, China

基于成本的Spark SQL优化器框架 (A cost-based optimizer framework for Spark SQL)

此演讲使用中文 (This will be presented in Chinese)

Ron Hu (Huawei Technologies), 王振华 (Huawei Technologies)
14:00–14:40 Friday, 2017-07-14
Spark及更多发展 (Spark & beyond)
Location: 紫金大厅B(Grand Hall B) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

SQL, Apache Spark, 数据库。

您将学到什么 (What you'll learn)


描述 (Description)

在Spark SQL的Catalyst优化器中,许多基于规则的优化技术已经实现,但优化器本身仍然有很大的改进空间。例如,没有关于数据分布的详细列统计信息,因此难以精确地估计过滤(filter)、连接(join)等数据库操作符的输出大小和基数 (cardinality)。由于不准确的估计,它经常导致优化器产生次优的查询执行计划。

我们在Spark SQL引擎内添加了一个基于成本的优化器框架。在我们的框架中,我们使用analyze table SQL语句收集表和列统计信息,并将它们保存到metastore中。对于列的统计信息,我们收集不同值的数量(number of distinct value),空值的数量,最大/最小值,平均/最大列长度等等。此外,我们保存等高直方图以更好地描述列的数据分布,从而有效地处理数据倾斜。此外,使用列的不同值的数量和表的行数量,我们可以确定列的唯一性,尽管Spark SQL不支持主键。这有助于确定例如 join 联接操作和多列 group by 分组操作的输出大小。

我们已把这个基于成本的优化器框架贡献给社区版本Spark 2.2。在我们的框架中,我们计算每个数据库操作符的基数和输出大小。通过可靠的统计和精确的估算,我们能够在这些领域做出好的决定:选择散列连接(hash join)操作的正确构建端(build side),选择正确的连接算法(如broadcast hash join与 shuffled hash join), 调整连接的顺序等等。在这次演讲中,我们将展示Spark SQL的新的基于成本的优化器框架及其对TPC-DS查询的性能影响。

Many rule-based optimizations have been implemented in the Spark SQL Catalyst optimizer, but the optimizer itself still has a lot of room for improvement. For example, there are no detailed column statistics about the data distribution, making it difficult to accurately estimate the output size and cardinality of database operators such as filter, join, and so on. Due to inaccurate estimates, it often results in the optimizer generating a suboptimal query execution plan.

Ron Hu and Zhenhua Wang explain how Huawei added a cost-based optimizer framework to the Spark SQL engine, using the analyze table SQL statement to collect table and column statistics and save them to metastore. For column statistics, Huawei collects the number of distinct values, the number of null values, the max/min value, the average and maximum column lengths, and so on and saves contour histograms to better describe the data distribution of the columns, effectively handling data skewness. Although Spark SQL does not support primary keys, using the number of different values of the columns and the number of rows in the table, Huawei can determine the uniqueness of columns. This helps to determine, for example, the output size of the join operation and the multicolumn group by operation.

The framework calculates the cardinality and output size of each database operator. With reliable statistics and accurate estimates, you can select the build side of the hash join operation, select the correct join algorithm, such as broadcast hash join and shuffled hash join, adjust the sequence of joins, and so on.

Huawei has contributed this cost-based optimizer framework to the community version of Spark 2.2. Ron and Zhenhua demonstrate the optimizer framework and discuss its impact on the performance of TPC-DS queries.

Photo of Ron Hu

Ron Hu

Huawei Technologies

Ron-Chung Hu is a database system architect at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. Previously, he worked at Teradata, Sybase, and MarkLogic, focusing on parallel database systems and search engines. Ron holds a PhD in computer science from the University of California, Los Angeles.

Photo of 王振华


Huawei Technologies

王振华 (Zhenhua Wang) is a research engineer at Huawei Technologies, where he works on building a big data analytics platform based on Apache Spark. He holds a PhD in computer science from Zhejiang University. His research interests include information retrieval and web data mining.

Connect with O'ReillyData

Use the QR Code to follow OReillyData and get the latest conference information and browse data articles.

WeChat QRcode


Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

Read the latest ideas on big data.

ORB Data Site