Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Modern query processing with columnar formats: The best is yet to come

Henry Robinson (Cloudera), Zuo Wang (Wanda), Arthur Peng (Intel)
11:20am–12:00pm Thursday, 10/01/2015
Hadoop Internals & Development
Location: 1 E16 / 1 E17 Level: Intermediate
Average rating: ***..
(3.71, 7 ratings)
Slides:   1-PDF 

Apache Parquet is an open-source file-format which arranges all of its data into columns – this is distinct from the traditional row-oriented layout, which stores entire rows consecutively. Columnar data offers lots of advantages to modern data engines – like Impala, Apache Spark, and Apache Flink – in terms of IO efficiency, but the full benefits of the format are yet to be realized.

We have been working with Intel to apply modern CPU instruction sets to the common programming tasks associated with querying data in Parquet format: decompression, predicate evaluation, and row-reconstruction. Our work has yielded significant speedups in standard query benchmarks running on Cloudera’s Impala SQL query engine, and very high speedups in targeted microbenchmarks.

In this talk we’ll describe the symbiosis between modern CPU architectures and the requirements of columnar data processing. We’ll show how vectorization – processing many items with a single instruction – is a widely applicable technique that can provide real performance benefits to all application frameworks that use columnar formats. We’ll present the changes that we have made to Impala’s ‘scanner,’ which reads Parquet data, and map out even more future enhancements.

This talk will be of interest to audiences interested in the internals of big data processing engines, or the impact of recent advances in modern CPU architectures.

Photo of Henry Robinson

Henry Robinson

Cloudera

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Photo of Zuo Wang

Zuo Wang

Wanda

Zuo Wang is a principal researcher at Wanda AI Technology Center. For the past few years, he has worked on large-scale distributed deep learning systems including PaddlePaddle, Mxnet, Tensorflow, and lead the effort to apply deep learning on clothes classification, clothing fashion ananlysis, and cross-domain clothing similarity matching. Zuo’s main interest is in deep learning, computer vision, and distributed systems. He used to work on MicroStrategy, a high performance enterprise analytics platform, and Apache Impala, an SQL query engine for data stored in Apache Hadoop.

Arthur Peng

Intel

Arthur Peng is a software engineer at Intel, where he works on applications of Intel’s CPU technology to Impala.

Comments on this page are now closed.

Comments

Picture of Zuo Wang
Zuo Wang
09/28/2015 3:03pm EDT

Can you check if your cluster machine support AVX2?
Show me the output of “cat /proc/cpuinfo|grep avx2”.
It should be at least a Haswell processor.

I am in travel. Sorry for the later reply.

Tom Palmer
09/26/2015 6:49am EDT

We use Cloudera EDH and are on version 5.4.3.

Our cluster uses Intel chips (E5). Just bought some of the nodes 2 months ago.

Will these chips be able to use this technology or do we need to buy another chip set.

Looking forward to your presentation.