The performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, utilizing all the free resources available on the host. Because each node is running a complex combination of different tasks and containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result in extreme swapping or thrashing. The impact of thrashing can be very severe; it can actually reduce the throughput instead of increasing it.
Sean Suchter and Shekhar Gupta explain how they used very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. (To build this system, they used hand-labeling of bad events combined with large-scale data processing using Hadoop, HBase, Spark, and IPython for experimentation.) By using very fine-grained (five-second) data from many production clusters running very different workloads, Sean and Shekhar have trained a generalized model that very rapidly detects the onset of thrashing within seconds of the first symptom. This detection has proven fast enough to enable effective mitigation of thrashing, allowing the hosts to continuously provide high throughput. Sean and Shekhar discuss the methods they used and share novel findings about big data cluster performance.
Sean Suchter is the CTO and cofounder of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search, and managed the Yahoo Search Technology team, the first production user of Hadoop. He joined Yahoo through the acquisition of Inktomi. Sean holds a BS in engineering and applied science from Caltech.
Shekhar Gupta is a software engineer at Pepperdata. He holds a PhD from TU Delft, where he focused on using machine learning to improve and monitor the performance of distributed systems.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.