Apache Parquet is an open-source file-format which arranges all of its data into columns – this is distinct from the traditional row-oriented layout, which stores entire rows consecutively. Columnar data offers lots of advantages to modern data engines – like Impala, Apache Spark, and Apache Flink – in terms of IO efficiency, but the full benefits of the format are yet to be realized.
We have been working with Intel to apply modern CPU instruction sets to the common programming tasks associated with querying data in Parquet format: decompression, predicate evaluation, and row-reconstruction. Our work has yielded significant speedups in standard query benchmarks running on Cloudera’s Impala SQL query engine, and very high speedups in targeted microbenchmarks.
In this talk we’ll describe the symbiosis between modern CPU architectures and the requirements of columnar data processing. We’ll show how vectorization – processing many items with a single instruction – is a widely applicable technique that can provide real performance benefits to all application frameworks that use columnar formats. We’ll present the changes that we have made to Impala’s ‘scanner,’ which reads Parquet data, and map out even more future enhancements.
This talk will be of interest to audiences interested in the internals of big data processing engines, or the impact of recent advances in modern CPU architectures.
Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.
Zuo Wang is a principal researcher at Wanda AI Technology Center. For the past few years, he has worked on large-scale distributed deep learning systems including PaddlePaddle, Mxnet, Tensorflow, and lead the effort to apply deep learning on clothes classification, clothing fashion ananlysis, and cross-domain clothing similarity matching. Zuo’s main interest is in deep learning, computer vision, and distributed systems. He used to work on MicroStrategy, a high performance enterprise analytics platform, and Apache Impala, an SQL query engine for data stored in Apache Hadoop.
Arthur Peng is a software engineer at Intel, where he works on applications of Intel’s CPU technology to Impala.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.