Sep 23–26, 2019

Apache Hadoop 3.x State of The Union and Upgrade Guidance

Wangda Tan (Cloudera), Arpit Agarwal (Hortonworks Inc.)
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08

Who is this presentation for?

CIO, Infrastructure Engineer, SRE




Apache Hadoop YARN is the modern Distributed Operating System for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

In this talk, we’ll start with the current status of Apache Hadoop 3.x – how it is used today in deployments large and small. We’ll then move on to the exciting present & future of Hadoop 3.x – features that are further strengthening Hadoop as the primary resource-management platform as well as the storage system for enterprise data-centers.

We’ll discuss the current status as well as the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.x:

For YARN 3.x, we have powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU & FPGA scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI, better queue management, etc.
Also, HDFS 3.0 announced GA for erasure coding which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability.

For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures.
Disk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput, and prevents from lopsided utilization if new disks are added or replaced in a DataNode. HDFS team is currently driving the Ozone initiative which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in Storage Containers for higher scale and handling of small objects in HDFS. Ozone project also includes an object store implementation to support new use cases.

At last, since more and more users are planning to upgrade from 2.x to 3.x to get all the benefits mentioned above, we will also briefly talk about upgrade guidance from Hadoop 2.x to 3.×.

Prerequisite knowledge

Basic understanding of Hadoop or related ecosystems/

What you'll learn

Understand what are the values (like new features of YARN and HDFS) of upgrading to Hadoop 3.x including YARN and HDFS. And how to upgrade Hadoop from 2.x to 3.x.
Photo of Wangda Tan

Wangda Tan


Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and engineering manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption etc efforts in the Hadoop YARN community. Before joining Cloudera, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated in creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.

Photo of Arpit Agarwal

Arpit Agarwal

Hortonworks Inc.

Arpit is an active HDFS/Hadoop committer since 2013 and a member of the Storage Engineering team at Cloudera.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts