Apache Hadoop 3.x state of the union and upgrade guidance

Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1E 07/08

Data Engineering and Architecture

Secondary topics: Deep dive into specific tools, platforms, or frameworks

Average rating:

(4.67, 3 ratings)

Download slides (PDF)

Who is this presentation for?

CIOs, infrastructure engineers, and SREs

Level

Intermediate

Description

Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without repeatedly worrying about resource management, isolation, multitenancy issues, etc. The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

Wangda Tan and Wei-Chiu Chuang the current status of Apache Hadoop 3.x—how it’s used today in deployments large and small, and they dive into the exciting present and future of Hadoop 3.x—features that further strengthen Hadoop as the primary resource-management platform and the storage system for enterprise data centers.

They explore the current status and the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.×. For YARN 3.x, there is powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU and field-programmable gate array (FPGA) scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service v2, a new web UI, better queue management, etc. Also, HDFS 3.0 announced GA for erasure coding, which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability.

For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures. Disk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput and prevents from lopsided utilization if new disks are added or replaced in a DataNode. The HDFS team is currently driving the Ozone initiative, which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support new use cases.

And you’ll leave with all the knowledge of how to upgrade painlessly from 2.x to 3.x to get all the benefits.

Prerequisite knowledge

A basic understanding of Hadoop or related ecosystems

What you'll learn

Understand the values of upgrading to Hadoop 3.x, including YARN and HDFS
Learn to upgrade Hadoop from 2.x to 3.x

Wangda Tan

Cloudera

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Wei-Chiu Chuang

Cloudera

Wei-Chiu Chuang is a software engineer at Cloudera, where he’s responsible for the development of Cloudera’s storage systems, mostly the Hadoop Distributed File System (HDFS). He’s an Apache Hadoop Committer and Project Management Committee member for his contribution in the open source project. He’s also a cofounder of the Taiwan Data Engineering Association, a nonprofit organization promoting better data engineering technologies and applications in Taiwan. Wei-Chiu earned his PhD in computer science from Purdue University for his research in distributed systems and programming models.

Comments on this page are now closed.