Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Introducing Iceberg: Tables designed for object stores

Owen O'Malley (Cloudera), Ryan Blue (Netflix)

5:25pm–6:05pm Wednesday, 09/12/2018

Data engineering and architecture
Location: 1E 09 Level: Intermediate

Average rating:

(4.33, 3 ratings)

Download slides (PDF)

Who is this presentation for?

Big data software engineers and data scientists

Prerequisite knowledge

Familiarity with big data processing in Hive, Spark, or Presto

What you'll learn

Explore Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores
Understand the motivation, design, and performance metrics of the new tables, which can be 10–100x faster for query planning

Description

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.

Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:

All reads use snapshot isolation without locking.
No directory listings are required for query planning.
Files can be added, removed, or replaced atomically.
Full schema evolution supports changes in the table over time.
Partitioning evolution enables changes to the physical layout without breaking existing queries.
Data files are stored as Avro, ORC, or Parquet.
Support for Spark, Hive, and Presto.

Owen O'Malley

Cloudera

Owen O’Malley is a cofounder and technical fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Previously, he was the architect of MapReduce, Security, and now Hive. He’s driving the development of the ORC file format and adding ACID transactions to Hive.

Website

Ryan Blue

Netflix

Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com