Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Vectorized query processing using Apache Arrow

Siddharth Teotia (Dremio)
1:50pm2:30pm Wednesday, March 7, 2018
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Those working in analytics, databases, high-performance computing, hardware, and big data processing

What you'll learn

  • Learn how to do vectorized query processing in Dremio using Apache Arrow
  • Explore the significant ongoing development in columnar database technology and how high-performance systems are being designed

Description

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

Columnar data has become the de facto format for building high-performance query engines that run analytical workloads. Apache Arrow is an in-memory columnar data format that houses canonical in-memory representations for both flat and nested data structures. It is a natural complement to on-disk formats like Apache Parquet and Apache ORC. Dremio’s query processing engine leverages the columnar format of Apache Arrow and Parquet for in-memory and on-disk representations respectively.

Data stored in a columnar format is amenable to processing using vectorized instructions (SIMD) available on all modern architectures. Query processing algorithms can implement simple and efficient code that operates on the columnar values in a tight loop, providing fast and CPU cache-friendly access patterns. Operations like SUM, FILTER, COUNT, MIN, and MAX on columnar data can be made more efficient by leveraging the data-level parallelism property of SIMD instructions.

Columnar data can be encoded using lightweight algorithms like dictionary encoding, run-length encoding, bit packing, and delta encoding that are far more CPU efficient than general-purpose compression algorithms like LZO and ZLIB. Furthermore, vectorized query processing algorithms can be written in a manner that are aware of column level encoding and can easily operate on the compressed column values in some cases. This saves CPU-memory bandwidth since we need only decompress the necessary column values. The columnar format allows us to efficiently utilize the CPU and GPU cache by filling cache lines with related data (column values from an in-memory vector). With the increasing use of GPUs and FPGAs, efficient use of the smaller on-chip memory available in these architectures is especially important. In addition, Apache Arrow allows for zero-copy, shared access to buffers so that multiple processes can more efficiently operate on the same data. On the storage side, columnar representation of on-disk data makes a good case for efficient utilization of disk I/O bandwidth for analytical queries.

Photo of Siddharth Teotia

Siddharth Teotia

Dremio

Siddharth Teotia is a software engineer at Dremio and a PMC for Apache Arrow project. Previously, Siddharth was a member of database kernel team at Oracle, where he worked on storage, indexing, and the in-memory columnar query processing layers of Oracle RDBMS. He holds an MS in software engineering from CMU and a BS in information systems from BITS Pilani, India. During his studies, Siddharth focused on distributed systems, databases, and software architecture.