Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Big data for big data: Machine-learning models of Hadoop cluster behavior

Sean Suchter (Pepperdata), Shekhar Gupta (Pepperdata)
11:50am12:30pm Wednesday, March 15, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Beginner

Who is this presentation for?

  • Data scientists and engineers

What you'll learn

  • Learn how to use machine learning to improve cluster performance


The performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, utilizing all the free resources available on the host. Because each node is running a complex combination of different tasks and containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result in extreme swapping or thrashing. The impact of thrashing can be very severe; it can actually reduce the throughput instead of increasing it.

Sean Suchter and Shekhar Gupta explain how they used very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. (To build this system, they used hand-labeling of bad events combined with large-scale data processing using Hadoop, HBase, Spark, and IPython for experimentation.) By using very fine-grained (five-second) data from many production clusters running very different workloads, Sean and Shekhar have trained a generalized model that very rapidly detects the onset of thrashing within seconds of the first symptom. This detection has proven fast enough to enable effective mitigation of thrashing, allowing the hosts to continuously provide high throughput. Sean and Shekhar discuss the methods they used and share novel findings about big data cluster performance.

Photo of Sean Suchter

Sean Suchter


Sean Suchter is the CTO and cofounder of Pepperdata. Previously, Sean was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search, and managed the Yahoo Search Technology team, the first production user of Hadoop. He joined Yahoo through the acquisition of Inktomi. Sean holds a BS in engineering and applied science from Caltech.

Photo of Shekhar Gupta

Shekhar Gupta


Shekhar Gupta is a software engineer at Pepperdata. He holds a PhD from TU Delft, where he focused on using machine learning to improve and monitor the performance of distributed systems.