San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Please log in

Add to Your Schedule

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark

Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)

14:05–14:45 Wednesday, 1 May 2019

Data Engineering and Architecture
Location: S11 A

Average rating:

(4.20, 10 ratings)

Who is this presentation for?

Data architects and SQL developers

Level

Intermediate

Prerequisite knowledge

Intermediate knowledge of Apache Impala or Apache Hive (how they execute queries, how they model partitions in HDFS, etc.)
Familiarity with the Apache Parquet format (how data is stored in a columnar way using pages and column chunks)

What you'll learn

Learn how the Apache Parquet file format implements row group statistics and page level indexes
Understand how these data structures help accelerate selective queries in Apache Impala
Learn how to leverage these new capabilities through complemented and augmented partitioning schemes
Explore performance benchmarks that illustrate the benefits of column indexes

Description

Apache Parquet is the most commonly used open source format for analytical data. The Parquet project recently added column indexes to the format, which enable query engines like Apache Impala, Apache Hive, and Apache Spark to achieve better performance on selective queries. By storing column statistics on a row group and on a page level, large parts of the input data can be discarded during query evaluation without incurring the cost of reading and decoding it. These new capabilities allow schema designers to achieve a reduction of I/O volume for selective queries without overpartitioning their tables.

Anna Szonyi and Zoltán Borók-Nagy explore row group statistics and page level indexes in Parquet files in detail. You’ll learn how query engines write these data structures and how they are used during query execution to achieve higher performance for selective queries. You’ll also learn how to leverage these new capabilities in your schema design and how to observe their impact on query performance.

Anna and Zoltán conclude by describing how the Parquet and Impala communities worked together to extend the file format and validate the approach using Impala as a platform for prototypical research. They also explain the challenges in adding the indexes to Parquet’s own API without breaking backward compatibility and how the project ensured that end users of the Parquet java library would get this functionality in a completely transparent manner.

Anna Szonyi

Cloudera

Anna Szonyi is an engineering manager at Cloudera, where she established and manages the data interoperability team. Anna cares about enabling people to build high-quality software in a sustainable environment. Previously, she was a software engineer at Cloudera working on Apache Sqoop and worked on risk management systems at Morgan Stanley.

Zoltán Borók-Nagy

Cloudera

Zoltán Borók-Nagy is a software engineer at Cloudera, working on Apache Impala and is a member of the PMC of the project. Previously, Zoltán worked for Ericsson, developing software analysis tools that have since become open source.

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com