7–9 November 2016: Conference & Tutorials
9–10 November 2016: Training
Amsterdam, The Netherlands

Data, data, everywhere, not an insight to take action upon

Arun Kejariwal (Independent)
14:40–15:20 Tuesday, 8/11/2016
Metrics/monitoring Cloud, DevOps G102/103 Audience level: Intermediate
Average rating: **...
(2.78, 9 ratings)

What you'll learn

  • Learn techniques for analyzing data in the presence of "holes" in the time series


The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.

A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):

  • Packets being dropped due to unresponsive downstream services
  • A network hiccup
  • Transient hardware or software failure
  • An issue with the data collection service

“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.

Photo of Arun Kejariwal

Arun Kejariwal


Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.