Skip to main content

HDFS Snapshots and Beyond

Jing Zhao (Hortonworks, Inc.), Tsz-Wo Sze (Hortonworks Inc.)
Hadoop Platform Grand Ballroom East
Average rating: ****.
(4.50, 8 ratings)
Slides:   1-PDF 

Snapshot has been added to HDFS towards better enterprise readiness and improved Business Continuity Planning. Snapshot provides a consistent point-in-time state of the file system. Important use cases of snapshots include data protection against accidental user errors, enabling experimental setups, and disaster recovery. In the first part of the talk, we will introduce how snapshot feature works and the best practices of its usage and management.

In the second part of the talk, we will provide details about snapshot design, development, and testing. Snapshot is efficiently implemented as a Namenode-only feature with low memory overhead and instantaneous creation. It does not have any storage overhead as no extra copies of data is required. The design also ensures no adverse impact on regular HDFS operations. Feature has undergone comprehensive testing with 1.4 million generated system tests covering most combination of operations mimicking real-world snapshot usage.

In the last part of the talk, we will discuss the new Hadoop use cases enabled by snapshot. We will first introduce how to improve Distcp and data mirroring using this feature. We will also explore how rest of the Hadoop stack could use this feature, such as HBase snapshots based on HDFS snapshots and Hive table snapshots.

Jing Zhao

Hortonworks, Inc.

Jing Zhao is a software engineer at Hortonworks. Currently he is working on HDFS. Before joining Hortonworks, he got his B.E. from Tsinghua University, China, and Ph.D. from University of Southern California, USA.

Tsz-Wo Sze

Hortonworks Inc.

Dr. Tsz-Wo Nicholas Sze is a Member of Technical Staff at Hortonworks and also a Member of the Project Management Committee at Apache Hadoop. His interests include distributed computing, algorithms and mathematical analysis. Two of his recent Hadoop contributions were HDFS Snapshots and WebHDFS. He used Hadoop with Yahoo’s clusters to accomplish a new computation world record of Pi in 2010. He received his Ph.D. degree in Computer Science from the University of Maryland College Park in 2007, and his M.Phil. and B.Eng. degrees from the Hong Kong University of Science and Technology respectively in 2001 and 1999.

Comments on this page are now closed.


Tsz-Wo Sze
10/30/2013 5:33pm EDT

Just have uploaded. Would you be able to download it?

Marek K Kolodziej
10/30/2013 4:32pm EDT

Would it be possible to post the slides here, like the other speakers have?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts