Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Why is my Hadoop job slow?

Bikas Saha (Hortonworks Inc)
14:05–14:45 Thursday, 2/06/2016
Hadoop internals & development
Location: Capital Suite 13 Level: Advanced
Average rating: ***..
(3.00, 6 ratings)

Prerequisite knowledge

Attendees should have a decent understanding of Hadoop and experience using Hadoop to do their data processing (be it via Spark, Hive, Pig, Oozie, MapReduce, etc.).

Description

Hadoop is used to run large-scale jobs that are subdivided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks, and at scale, there can be thousands of tasks running over thousands of machines, which makes it difficult to make sense of their performance. Add to that pipelines that logically run a business workflow as another level of complexity, and it’s no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe some methodical approaches and present new tracing and tooling ideas that can help semi-automate parts of this difficult problem.

Photo of Bikas Saha

Bikas Saha

Hortonworks Inc

Bikas Saha has been working in the Apache Hadoop ecosystem since 2011, focusing on YARN and the Hadoop compute stack, and is a committer/PMC member of the Apache Hadoop and Tez projects. Bikas is currently working on Apache Tez, a new framework to build high-performance data processing applications natively on YARN. He has been a key contributor in making Hadoop run natively on Windows. Prior to Hadoop, he worked extensively on the Dryad distributed data processing framework that runs on some of the world’s largest clusters as part of Microsoft’s Bing infrastructure.