Hadoop is used to run large-scale jobs that are subdivided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks, and at scale, there can be thousands of tasks running over thousands of machines, which makes it difficult to make sense of their performance. Add to that pipelines that logically run a business workflow as another level of complexity, and it’s no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe some methodical approaches and present new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
Bikas Saha has been working in the Apache Hadoop ecosystem since 2011, focusing on YARN and the Hadoop compute stack, and is a committer/PMC member of the Apache Hadoop and Tez projects. Bikas is currently working on Apache Tez, a new framework to build high-performance data processing applications natively on YARN. He has been a key contributor in making Hadoop run natively on Windows. Prior to Hadoop, he worked extensively on the Dryad distributed data processing framework that runs on some of the world’s largest clusters as part of Microsoft’s Bing infrastructure.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.