Increasing demand for more and higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. Netflix’s Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 40 petabytes of data. At this scale, we are reaching the limits of partitioning, with thousands of tables and millions of partitions per table.
To work around the diminishing returns of additional partition layers, the team increasingly relies on the Parquet file format and recently made additions to Presto that resulted in an over 100x performance improvement for some real-world queries over Parquet data. The team is currently adding similar functionality to other processing engines like Spark, Hive, and Pig. Data written in Parquet is not optimized by default for these newer features, so the team is tuning how they write Parquet to maximize the benefit.
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.