Skip to main content

Statistical Learning-based Automatic Anomaly Detection @Twitter

Operations
Location: 211 Level: Intermediate
Average rating: ***..
(3.76, 17 ratings)

Performance and high availability have become increasingly important drivers, amongst other drivers, for user retention in the context of web services such as social networks, and web search. Exogenic and/or endogenic factors often give rise to anomalies, making it very challenging to maintain high availability, while also delivering high performance. Given that service-oriented architectures (SOA) typically have a large number of services, with each service having a large set of metrics, visual detection of anomalies is not pragmatic. Furthermore, automatic detection of anomalies is non-trivial owing to the following reasons:

  • Organic growth and other factors induce an underlying trend in time series observed in production at Twitter. Due to this, using a static threshold for classifying certain data points in a given time series as anomalous results in large number of false positives.
  • Given the social and global nature of Twitter, the time series observed in production often exhibit seasonality. Due to this, using a static threshold for classifying certain data points in a given time series as anomalous results in large number of false positives.
  • The traditional metrics – mean and variance – in anomaly detection algorithms are themselves susceptible to anomalies. This may potentially result in a high false positive rate.

To this end, at Twitter, we developed novel statistical techniques for automatically detecting anomalies in cloud infrastructure data. Specifically, the techniques employ statistical learning to detect anomalies in both application, and system metrics.

1. We employ time series decomposition to filter the trend and seasonal components of the time series.

2. We use of robust statistical metrics – median and median absolute deviation (MAD) – to accurately detect anomalies, even in the presence of seasonal spikes.

The techniques we shall present was evaluated with a wide variety – system and application metrics obtained from production as well stock data – of time series and has been deployed in production at Twitter. We demonstrate the efficacy of the proposed techniques using production data.

The proposed talk is complementary to the talks presented at Velocity London’13 on Anomaly Detection – by Jon from Etsy and by Toufic from Metafor.

Photo of Arun Kejariwal

Arun Kejariwal

MZ

@arun_kejariwal is currently a Capacity and Performance Engineer at Twitter where he works on research and development of novel techniques to improve the accuracy of capacity models and demand forecasts. Prior to joining Twitter, @arun_kejariwal worked on research and development of practical and statistically rigorous methodologies to deliver high performance, availability and scalability in large scale distributed clusters. Some of the techniques developed have been published in peer-reviewed international conferences/journals.

@arun_kejariwal received his Bachelor’s degree in EE from IIT Delhi and doctorate in CS from UCI.