Sep 23–26, 2019
Please log in

Apache Hadoop 3.x state of the union and upgrade guidance

Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08
Average rating: ****.
(4.67, 3 ratings)

Who is this presentation for?

  • CIOs, infrastructure engineers, and SREs




Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without repeatedly worrying about resource management, isolation, multitenancy issues, etc. The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

Wangda Tan and Wei-Chiu Chuang the current status of Apache Hadoop 3.x—how it’s used today in deployments large and small, and they dive into the exciting present and future of Hadoop 3.x—features that further strengthen Hadoop as the primary resource-management platform and the storage system for enterprise data centers.

They explore the current status and the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.×. For YARN 3.x, there is powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU and field-programmable gate array (FPGA) scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service v2, a new web UI, better queue management, etc. Also, HDFS 3.0 announced GA for erasure coding, which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability.

For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures. Disk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput and prevents from lopsided utilization if new disks are added or replaced in a DataNode. The HDFS team is currently driving the Ozone initiative, which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support new use cases.

And you’ll leave with all the knowledge of how to upgrade painlessly from 2.x to 3.x to get all the benefits.

Prerequisite knowledge

  • A basic understanding of Hadoop or related ecosystems

What you'll learn

  • Understand the values of upgrading to Hadoop 3.x, including YARN and HDFS
  • Learn to upgrade Hadoop from 2.x to 3.x
Photo of Wangda Tan

Wangda Tan


Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Photo of Wei-Chiu Chuang

Wei-Chiu Chuang


Wei-Chiu Chuang is a software engineer at Cloudera, where he’s responsible for the development of Cloudera’s storage systems, mostly the Hadoop Distributed File System (HDFS). He’s an Apache Hadoop Committer and Project Management Committee member for his contribution in the open source project. He’s also a cofounder of the Taiwan Data Engineering Association, a nonprofit organization promoting better data engineering technologies and applications in Taiwan. Wei-Chiu earned his PhD in computer science from Purdue University for his research in distributed systems and programming models.

Comments on this page are now closed.


Picture of Wangda Tan
Wangda Tan | Product Management Committee Member | Engineering Manager
09/26/2019 11:49am EDT

Hi Abhinav, it is here:

Abhinav Choudhury | Senior Software Engineer
09/26/2019 6:59am EDT

Hi, will the slides for this presentation be made available to all?

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires