Build resilient systems at scale
May 27–29, 2015 • Santa Clara, CA

Stream processing and anomaly detection @Twitter

Arun Kejariwal (Machine Zone), Sailesh Mittal (Twitter), Karthik Ramasamy (Streamlio)
1:45pm–2:25pm Thursday, 05/28/2015
Location: Ballroom CD
Average rating: **...
(2.95, 20 ratings)

Prerequisite Knowledge

The talk is self-contained and no prior knowledge is required.

Description

Big data has been playing a vital role in every sphere of business, for example surfacing personalized content in timelines, to provide highly available and performant service to the end user. This rests, in part, upon the availability of high fidelity data. However, exogenic and/or endogenic factors often give rise to anomalies. At web scale, with a large number of services and with each service having a large set of metrics, visual detection of anomalies is not pragmatic. Furthermore, automatic detection of anomalies is non-trivial owing to the following reasons:

  • There’s often a time skew between the time series of the different metrics (even for metrics of the same service).
  • Given the social and global nature of Twitter, the time series observed in production often exhibit seasonality. Due to this, using a static threshold for classifying certain data points in a given time series as anomalous results in a large number of false positives.
  • The number of time series available for analysis may change at run time. For example, in Storm the number of instances for each component may change dynamically.

To this end, at Twitter we developed novel statistical techniques for automatically detecting anomalies. In January 2015, we open-sourced a standalone R package. Since then, we have extended the techniques to leverage multiple time series to minimize the false positives rate. Specifically:

1. We exploit correlations between metrics – for example, multiple metrics of the different components in a Storm topology.

2. We address the potential skew between different time series via interval intersection and/or convolution analysis.

The techniques we shall present were evaluated with a wide variety – system and application metrics obtained from production – of time series.

The proposed talk is complementary to the talk I gave at Velocity in November 2014.

Photo of Arun Kejariwal

Arun Kejariwal

Machine Zone

@arun_kejariwal is a software engineer at Twitter, where he works on research and development of novel techniques for time series analysis. Prior to joining Twitter, Arun worked on research and development of practical and statistically rigorous methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been published in peer-reviewed international conferences and journals.

Arun received his Bachelor’s degree in EE from IIT Delhi and doctorate in CS from UCI.

Sailesh Mittal

Twitter

TBD

Photo of Karthik Ramasamy

Karthik Ramasamy

Streamlio

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Comments on this page are now closed.

Comments

Picture of Karthik Ramasamy
Karthik Ramasamy
06/30/2015 1:40am PDT

@sri sod – Storm bolts do not provide inbuilt mechanism for state. Hence, some kind of cache or stable storage is needed. The type of cache – local or distributed depends on tolerance levels you need.

- Distributed cache – provides more fault tolerance when a bolt is relocated

- Local cache – accumulated state will be lost when the machine fails and the bolt is relocated to a new machine.

However, both of them do not guarantee exactly once semantics.

sri sod
06/08/2015 6:30am PDT

This question for Karthik Ramasamy. I am interested in knowing the caching mechanism you have adapted in analyzing the incoming time series in storm bolts. Did you employ any external distributed cache or local cache.

Appreciate your