Mar 15–18, 2020

Accelerating Spark-SQL with AVX-supported vectorization

Chendi Xue (Intel), Jian Zhang (Intel)
4:15pm4:55pm Wednesday, March 18, 2020
Location: LL21A

Who is this presentation for?

Data engineers, data architects, developers




Spark SQL brings native support for SQL to Spark and is widely adopted. Current Spark SQL uses row-based data processing in a majority of operators and leveraged WholeStageCodegen to improve performance by generating a runtime Java function to eliminate virtual function calls and leverage CPU registers for intermediate data. To enable vectorization support in Spark SQL, operators process multiple rows in one function call, and with the help of singular instruction multiple data (SIMD) instructions, one batch process function can process more data with fewer CPU cycles. Which is to say, leveraging SIMD instructions in modern processors, vectorization enables Spark SQL processing with reduced CPU cycles and increased throughput and drive its performance to a next level.

Chendi Xue and Jian Zhang explore how Intel used this technology to accelerate Spark SQL, enabling a series of Spark SQL operators with columnar process support, including Spark SQL input and output operators. You’ll learn how Intel used Apache Arrow to hold ColumnarBatch data inside native memory and manage its memory reference inside Spark, as well as how Intel leveraged the Apache Arrow Gandiva project to do ExpressionTree just-in-time (JIT) compiling and ColumnarBatch data evaluation. They explain how Intel optimized and extended Gandiva to support more operators such as Spark partition extraction, ColumnarBatch data join, aggregate, and sort. This supports vectorization for more SQL queries and maximizes the performance by Intel AVX512 SIMD instruction sets.

Chendi and Jian detail performance evaluation and analysis, as well as TPC-DS queries to evaluate performance between the implementation with current Spark SQL WholeStageCodecen-enabled mode. The performance analysis contains all levels of system metrics and latency breakdown.

Prerequisite knowledge

  • A basic understanding of Spark SQL

What you'll learn

  • Learn about current Spark SQL row-based implementation, Spark SQL ColumnarBased operator integration, and compiling columnar ExpressionTree with AVX
Photo of Chendi Xue

Chendi Xue


Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Photo of Jian Zhang

Jian Zhang


Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires