Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Parquet performance tuning: The missing guide

Ryan Blue (Netflix)
11:20am–12:00pm Thursday, 09/29/2016
Data innovations
Location: 1 E 07/1 E 08 Level: Intermediate
Average rating: ****.
(4.71, 7 ratings)

Prerequisite knowledge

  • A very basic understanding of Parquet
  • What you'll learn

  • Learn the basics of writing Parquet data to get great performance
  • Description

    Increasing demand for more and higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. Netflix’s Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 40 petabytes of data. At this scale, we are reaching the limits of partitioning, with thousands of tables and millions of partitions per table.

    To work around the diminishing returns of additional partition layers, the team increasingly relies on the Parquet file format and recently made additions to Presto that resulted in an over 100x performance improvement for some real-world queries over Parquet data. The team is currently adding similar functionality to other processing engines like Spark, Hive, and Pig. Data written in Parquet is not optimized by default for these newer features, so the team is tuning how they write Parquet to maximize the benefit.

    Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.

    Topics include:

    • The tools and techniques Netflix uses to analyze Parquet tables
    • How to spot common problems
    • Recommendations for Parquet configuration settings to get the best performance out of your processing platform
    • The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
    Photo of Ryan Blue

    Ryan Blue

    Netflix

    Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

    Comments on this page are now closed.

    Comments

    Picture of Ryan Blue
    Ryan Blue
    09/30/2016 1:34pm EDT

    Looks like uploading them to the O’Reilly page didn’t link them here, so I uploaded the slides to SlideShare: http://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide

    Igor Vasilchikov
    09/29/2016 2:54pm EDT

    Any chance to get slides from the talk ? Unfortunately i missed it!

    09/29/2016 10:05am EDT

    Could you please post the link to your presentation here. Thanks!