Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Introducing Iceberg: Tables designed for object stores

Owen O'Malley (Cloudera), Ryan Blue (Netflix)
5:25pm–6:05pm Wednesday, 09/12/2018
Data engineering and architecture
Location: 1E 09 Level: Intermediate
Average rating: ****.
(4.33, 3 ratings)

Who is this presentation for?

  • Big data software engineers and data scientists

Prerequisite knowledge

  • Familiarity with big data processing in Hive, Spark, or Presto

What you'll learn

  • Explore Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores
  • Understand the motivation, design, and performance metrics of the new tables, which can be 10–100x faster for query planning

Description

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.

Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:

  • All reads use snapshot isolation without locking.
  • No directory listings are required for query planning.
  • Files can be added, removed, or replaced atomically.
  • Full schema evolution supports changes in the table over time.
  • Partitioning evolution enables changes to the physical layout without breaking existing queries.
  • Data files are stored as Avro, ORC, or Parquet.
  • Support for Spark, Hive, and Presto.
Photo of Owen O'Malley

Owen O'Malley

Cloudera

Owen O’Malley is a cofounder and technical fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Previously, he was the architect of MapReduce, Security, and now Hive. He’s driving the development of the ORC file format and adding ACID transactions to Hive.

Photo of Ryan Blue

Ryan Blue

Netflix

Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.