Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Netflix: Making big data small

Daniel Weeks (Netflix)
11:00am–11:40am Thursday, 03/31/2016
Data Innovations

Location: LL21 E/F
Tags: media
Average rating: ****.
(4.56, 27 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of the Hadoop ecosystem and some background in distributed processing.

Description

Increasing demand for more higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. At Netflix, the Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 25 petabytes of data. Even with great effort to optimize datasets, our metadata is growing to the point it is big data. With thousands of tables—and partitions per table reaching into the millions—approaches to isolate data and process it efficiently are breaking down.

The Big Data Platform team recently made additions to Presto and is adding similar functionality to other processing engines like Spark, Hive, and Pig that will allow for new paradigms of efficient storage and optimized processing. The foundation for this effort is the Parquet file format, which is the predominant storage format in our data warehouse. Integrating advanced features of this format with capabilities of modern processing engines allows us to enhance processing by use of the following:

  • Vectorized read path for Parquet
  • Predicate pushdown with column statistics
  • Predicate pushdown with dictionary encoding

Daniel Weeks takes attendees through how Netflix combines these features with storage patterns and other enhancements the team has made to achieve performance improvements, which can exceed 100x in real-world query cases. Daniel will go into detail about how Netflix’s big data platform collects, processes, and stores data from the data pipeline to take advantage of Parquet and how the team uses Spark on YARN for ETL and Presto for interactive analytics to enhance performance. He will also demonstrate applications of this approach that dramatically speed up the querying of telemetry-service data and A/B-testing-platform data.

Photo of Daniel Weeks

Daniel Weeks

Netflix

Daniel Weeks manages the big data compute team at Netflix and is a Parquet committer. Previously, Daniel focused on research in big data solutions and distributed systems.

Comments on this page are now closed.

Comments

Picture of Daniel Weeks
Daniel Weeks
04/04/2016 8:51am PDT

Deepak, there should be a video link coming. It usually takes a little to edit and post the videos. I believe they will send an email when the videos are available.

Picture of deepak agarwal
04/04/2016 7:49am PDT

Is there a video link available for this ?

Prakash Killada
04/01/2016 6:52am PDT

can you please share the presentation ?