Build resilient systems at scale
28–30 October 2015 • Amsterdam, The Netherlands

Finding bad apples early: Minimizing performance impact

Arun Kejariwal (Independent)
17:05–17:45 Thursday, 29/10/2015
Location: Emerald Room
Average rating: ***..
(3.07, 15 ratings)
Slides:   external link

Prerequisite Knowledge

The talk shall be self-contained. No prerequisite is required.


The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.

The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:

  • Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
  • Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification

The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!

We shall walk the audience through how the techniques are being used with REAL data.

Photo of Arun Kejariwal

Arun Kejariwal


Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Comments on this page are now closed.


Picture of Arun Kejariwal
Arun Kejariwal
4/11/2015 17:16 CET

John Denholm
4/11/2015 14:27 CET