Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Learning the relationships between time series metrics at scale; or, Why you can never find a taxi in the rain

Ira Cohen (Anodot)
9:059:30 Tuesday, 23 May 2017
Hardcore Data Science
Location: London Suite 2/3
Secondary topics:  IoT, Streaming
Level: Intermediate
Average rating: *****
(5.00, 4 ratings)

To gain insights from large-scale time series metrics and use them as the basis for accurate predictions, root cause diagnosis, and other tasks, it’s important to discover the relationships among the metrics (i.e., the correlations between them). For example, if you need to predict how much revenue an ecommerce site will generate this quarter, one very rough method would be to use the previous quarter’s revenue as a guide, but it will not take into consideration any other valid parameters, such as how much traffic came to the site in the current quarter, the site’s bounce rate, or other metrics that may be much better predictors.

However, to understand which metrics can be used as predictors (or other tasks), one must understand what metrics are related to each other and how. For a small-scale operation, these relationships can be manually defined. For certain types of metrics, such as IT, tools such as configuration management databases (CMDB) may automate some of the discovery of the relationships between the metrics. But if you want to incorporate metrics beyond IT (e.g., application metrics or business metrics like revenue) at the vast scale most digital businesses require, machine learning tools are needed.

Ira Cohen shares key machine-learning methods for correlating metrics at scale, without having to do any manual configuration. Implementing these methods at scale can be computationally expensive, so Ira suggests methods for reducing the computational resources needed. (In particular, Ira explains how to efficiently scale the similarity and clustering methods.) And since correlation does not necessarily equal causation, Ira also covers ways to identify causality.

Topics include:

  • Abnormal similarities: If certain metrics tend to go off-program at the same time or at similar intervals, they may be related. For example, if website latency goes up and transaction time goes up while revenue is reduced, it’s possible that these three metrics are related to each other. Similarly, if every time the weather deteriorates, taxi availability goes to zero, these two metrics may be related. Ira explains what to look for in abnormal similarities and what types of algorithms (such as clustering algorithms) can be used to identify this.
  • Metadata similarities: Each metric has metadata associated with it, describing what is measured, where, and how. When collecting many metrics, similarities in their metadata properties can be an extremely valuable way to identify related or correlated metrics. Ira shares algorithms for discovering similarities in metadata of millions to billions of metrics.
  • Normal behavior similarities: Machine-learning algorithms can be used to contrast the shapes and behavior of data metrics when they are behaving normally. While it seems straightforward to use standard correlation algorithms for this (such as the Pearson correlation coefficient), off-the-shelf algorithms can generate many false positives. Ira discusses techniques to neutralize these false positives and generate usable results.
Photo of Ira Cohen

Ira Cohen


Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Comments on this page are now closed.


26/05/2017 9:24 BST

Hello Ira, do you plan to share the slides?