YARN - Next Generation Hadoop Map-Reduce

Data: Hadoop
Location: C124
Average rating: ***..
(3.00, 4 ratings)

The Apache Hadoop Map-Reduce framework is showing it’s age, clearly.

In particular, the Map-Reduce JobTracker needs a drastic overhaul to address several technical deficiencies in its memory consumption, much better threading-model and scalability/reliability/performance given observed trends in cluster sizes and workloads. Periodically, we have done running repairs. However, lately these have come at an ever-growing cost as evinced by the worrying regular site-up issues we have seen in the past year. The architectural deficiencies, and corrective measures, are both old and well understood – even as far back as late 2007: https://issues.apache.org/jira/browse/MAPREDUCE-278.

The most pressing requirements for the next generation of the Map-Reduce framework are:

  • Reliability
  • Availability
  • Scalability – Clusters of 10000 nodes and 200,000 cores
  • Backward Compatibility – Ensure customers’ Map-Reduce applications can run unchanged in the next version of the framework. Also implies forward compatibility.
  • Evolution – Ability for customers to control upgrades to the grid software stack.
  • Predictable Latency – A major customer concern.
  • Cluster utilization
  • Support for alternate programming paradigms to Map-Reduce
  • Support for limited, short-lived services

The fundamental idea of YARN is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

This talk will cover more of YARN design and architecture and how it improves Apache Hadoop to process data better via Hadoop Map-Reduce and allows for other programming paradigms on Hadoop grids.

Photo of Arun Murthy

Arun Murthy

Hortonworks Inc.

Arun is the lead of the next generation MapReduce project in Apache Hadoop. Arun has been a full-time contributor to Apache Hadoop since its inception in 2006. He is a long-time committer and member of the Apache Hadoop PMC and jointly holds the current world sorting record using Apache Hadoop. Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Follow Arun on Twitter: @acmurthy.
He is directly responsible for every bit of code and configuration of Map-Reduce deployed at over 40,000 machines running Apache Hadoop at Yahoo. He jointly holds the world-record for sorting data using Hadoop Map-Reduce.