Apache Hadoop 3.x State of The Union and Upgrade Guidance
Who is this presentation for?CIO, Infrastructure Engineer, SRE
Prerequisite knowledgeBasic understanding of Hadoop or related ecosystems/
What you'll learn
Apache Hadoop YARN is the modern Distributed Operating System for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
In this talk, we’ll start with the current status of Apache Hadoop 3.x – how it is used today in deployments large and small. We’ll then move on to the exciting present & future of Hadoop 3.x – features that are further strengthening Hadoop as the primary resource-management platform as well as the storage system for enterprise data-centers.
We’ll discuss the current status as well as the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.x:
For YARN 3.x, we have powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU & FPGA scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI, better queue management, etc.
Also, HDFS 3.0 announced GA for erasure coding which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability.
For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures.
Disk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput, and prevents from lopsided utilization if new disks are added or replaced in a DataNode. HDFS team is currently driving the Ozone initiative which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in Storage Containers for higher scale and handling of small objects in HDFS. Ozone project also includes an object store implementation to support new use cases.
At last, since more and more users are planning to upgrade from 2.x to 3.x to get all the benefits mentioned above, we will also briefly talk about upgrade guidance from Hadoop 2.x to 3.×.
Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and engineering manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption etc efforts in the Hadoop YARN community. Before joining Cloudera, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated in creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.
Jitendra Pandey leads HDFS, Ozone and HBase engineering at Hortonworks Inc and has been contributing to Hadoop Ecosystem for more than 9 years. Jitendra is a committer and PMC member for Apache Hadoop. He is also a committer for Apache Ambari and Apache Hive projects. Jitendra’s contributions include various areas in Ozone, HDFS, Vectorized query processing in Hive, and Hadoop security infrastructure. Prior to Hortonworks, Jitendra worked at Yahoo in Big Data infrastructure, and applications.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts