Mar 15–18, 2020

It’s 2020 now: Apache Hadoop 3.x state of the union and upgrade guidance

Wangda Tan (Cloudera), Arpit Agarwal (Cloudera)
2:35pm3:15pm Wednesday, March 18, 2020
Location: LL20A

Who is this presentation for?

Data engineers, data architects, developers

Level

Intermediate

Description

Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer into a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without repeatedly worrying about resource management, isolation, multitenancy issues, etc. The Hadoop distributed file system (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

Wangda Tan and Arpit Agarwal cover the current status of Apache Hadoop 3.x—how it’s used today in deployments large and small—and the exciting present and future of Hadoop 3.x, features that further strengthen Hadoop as the primary resource-management platform as well as the storage system for enterprise data centers.

You’ll discover the current status and the future promise of features and initiatives for YARN and HDFS of Hadoop 3.×. YARN 3.x, has powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU and field-programmable gate array (FPGA) scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application and services upgrades, powerful scheduling features like application priorities, intraqueue preemption across applications and operational enhancements including insights through Timeline Service v2, a new web UI, better queue management, etc. Also, HDFS 3.0 announced GA for erasure coding, which doubles the storage efficiency of data and reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability.

For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures. Disk balancing within a DataNode was another important feature added to ensure disks are evenly used in a DataNode, which also ensures better aggregate throughput, and prevents from lopsided use if new disks are added or replaced in a DataNode. The HDFS team is driving the Ozone initiative, which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support new use cases.

And since more and more users are planning to upgrade from 2.x to 3.x to get all those benefits, you’ll get upgrade guidance from Hadoop 2.x to 3.×.

Prerequisite knowledge

  • Experience using and developing with Hadoop (and its friends like Spark, Flink, Hive, etc.)

What you'll learn

  • Understand what's new in Hadoop 3.x (in 2020) and 2.x -> 3.x
  • Get upgrading guidance
Photo of Wangda Tan

Wangda Tan

Cloudera

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Photo of Arpit Agarwal

Arpit Agarwal

Cloudera

Arpit Agarwal is an engineer in the storage team at Cloudera and an active HDFS/Hadoop committer since 2013.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires