Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark

Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)
14:0514:45 Wednesday, 1 May 2019
Average rating: ****.
(4.20, 10 ratings)

Who is this presentation for?

  • Data architects and SQL developers

Level

Intermediate

Prerequisite knowledge

  • Intermediate knowledge of Apache Impala or Apache Hive (how they execute queries, how they model partitions in HDFS, etc.)
  • Familiarity with the Apache Parquet format (how data is stored in a columnar way using pages and column chunks)

What you'll learn

  • Learn how the Apache Parquet file format implements row group statistics and page level indexes
  • Understand how these data structures help accelerate selective queries in Apache Impala
  • Learn how to leverage these new capabilities through complemented and augmented partitioning schemes
  • Explore performance benchmarks that illustrate the benefits of column indexes

Description

Apache Parquet is the most commonly used open source format for analytical data. The Parquet project recently added column indexes to the format, which enable query engines like Apache Impala, Apache Hive, and Apache Spark to achieve better performance on selective queries. By storing column statistics on a row group and on a page level, large parts of the input data can be discarded during query evaluation without incurring the cost of reading and decoding it. These new capabilities allow schema designers to achieve a reduction of I/O volume for selective queries without overpartitioning their tables.

Anna Szonyi and Zoltán Borók-Nagy explore row group statistics and page level indexes in Parquet files in detail. You’ll learn how query engines write these data structures and how they are used during query execution to achieve higher performance for selective queries. You’ll also learn how to leverage these new capabilities in your schema design and how to observe their impact on query performance.

Anna and Zoltán conclude by describing how the Parquet and Impala communities worked together to extend the file format and validate the approach using Impala as a platform for prototypical research. They also explain the challenges in adding the indexes to Parquet’s own API without breaking backward compatibility and how the project ensured that end users of the Parquet java library would get this functionality in a completely transparent manner.

Photo of Anna Szonyi

Anna Szonyi

Cloudera

Anna Szonyi is an engineering manager at Cloudera, where she established and manages the data interoperability team. Anna cares about enabling people to build high-quality software in a sustainable environment. Previously, she was a software engineer at Cloudera working on Apache Sqoop and worked on risk management systems at Morgan Stanley.

Photo of Zoltán Borók-Nagy

Zoltán Borók-Nagy

Cloudera

Zoltán Borók-Nagy is a software engineer at Cloudera, working on Apache Impala and is a member of the PMC of the project. Previously, Zoltán worked for Ericsson, developing software analysis tools that have since become open source.