Apache Parquet is the most commonly used open source format for analytical data. The Parquet project recently added column indexes to the format, which enable query engines like Apache Impala, Apache Hive, and Apache Spark to achieve better performance on selective queries. By storing column statistics on a row group and on a page level, large parts of the input data can be discarded during query evaluation without incurring the cost of reading and decoding it. These new capabilities allow schema designers to achieve a reduction of I/O volume for selective queries without overpartitioning their tables.
Anna Szonyi and Zoltán Borók-Nagy explore row group statistics and page level indexes in Parquet files in detail. You’ll learn how query engines write these data structures and how they are used during query execution to achieve higher performance for selective queries. You’ll also learn how to leverage these new capabilities in your schema design and how to observe their impact on query performance.
Anna and Zoltán conclude by describing how the Parquet and Impala communities worked together to extend the file format and validate the approach using Impala as a platform for prototypical research. They also explain the challenges in adding the indexes to Parquet’s own API without breaking backward compatibility and how the project ensured that end users of the Parquet java library would get this functionality in a completely transparent manner.
Anna Szonyi is an engineering manager at Cloudera, where she established and manages the data interoperability team. Anna cares about enabling people to build high-quality software in a sustainable environment. Previously, she was a software engineer at Cloudera working on Apache Sqoop and worked on risk management systems at Morgan Stanley.
Zoltán Borók-Nagy is a software engineer at Cloudera, working on Apache Impala and is a member of the PMC of the project. Previously, Zoltán worked for Ericsson, developing software analysis tools that have since become open source.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com