Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

The evolution of Netflix's S3 data warehouse

Ryan Blue (Netflix), Daniel Weeks (Netflix)
1:15pm–1:55pm Wednesday, 09/12/2018
Big data and data science in the cloud
Location: 1A 10 Level: Intermediate
Secondary topics:  Data Platforms
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Software and data engineers

Prerequisite knowledge

  • A high-level understanding of S3 and how it is used in Hadoop (useful but not required)

What you'll learn

  • Explore the tools Netflix currently uses and those it has retired, how Netflix's use of S3 for a data warehouse has changed over time, current recommendations, and what the company is working on next

Description

In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.

Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:

  • Snapshot isolation with no directory listing or file renames
  • Distributed planning to relieve metastore bottlenecks
  • Improved data layout for S3 performance
  • Immediately available writes from streaming applications
  • Opportunistic compaction and data optimization
Photo of Ryan Blue

Ryan Blue

Netflix

Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Photo of Daniel Weeks

Daniel Weeks

Netflix

Daniel Weeks manages the big data compute team at Netflix and is a Parquet committer. Previously, Daniel focused on research in big data solutions and distributed systems.