Build & maintain complex distributed systems
17–18 October 2017: Training
18–20 October 2017: Tutorials & Conference
London, UK

In-Person Training
Data science for effective operations

Heinrich Hartmann (Circonus)
Tuesday, 17 October & Wednesday, 18 October, 9:00 - 17:00
Location: Hilton Meeting Room 15-17 Level: Intermediate
Average rating: **...
(2.25, 4 ratings)

Participants should plan to attend both days of this 2-day training course. Platinum and Training passes do not include access to tutorials on Wednesday.

Gathering telemetry data is key to operating reliable distributed systems at scale. Heinrich Hartmann explores a wide range of data science and analysis methods (both theoretical and practical) that can make you more effective at an operations task.

What you'll learn, and how you can apply it

  • Learn how to interpret metrics and graphs presented by monitoring tools
  • Gain the mathematical background to reason about telemetry data and aggregation
  • Understand what specific metrics mean
  • Explore advanced topics like forecasting and anomaly detection

This training is for you because...

  • You're an SRE, operations engineer, sysadmin, or developer who wants to become effective at operations.

Prerequisites:

  • Experience using a monitoring system
  • A working knowledge of Python (If you are not familiar with the language, please take a minute to check out the basic data and control structures.)

Hardware and/or installation requirements:

  • Pen and paper
  • A Linux-based laptop with Docker and the Jupyter Notebook installed (Run docker pull jupyter/datascience-notebook prior to the conference or download Anaconda including the Jupyter Notebook.)

Gathering telemetry data is key to operating reliable distributed systems at scale. Heinrich Hartmann explores a wide range of data science and analysis methods (both theoretical and practical) that can make you more effective at an operations task.

Outline

Day 1

Descriptive statistics

  • Visualizations
  • Summary statistics (mean, stddev, median, percentiles, IQR)
  • Robustness and mergability (desirable properties for ops applications)
  • Histograms

Time series analysis

  • Regressions
  • Filters and exponential smoothing
  • Approaches to anomaly detection

Metrics: The good, the bad, the ugly

  • What is monitoring?
  • How to measure system properties properly? (event data, state accounting, durations)
  • Problems with CPU utilization metrics
  • How to monitor APIs (p99 across a fleet of containers)
  • How to deal with ephemeral metrics

Day 2

Tools for data analysis

  • Python, Jupyter, and NumPy
  • Command-line tools (csvkit, feedgnuplot)

Data analysis exercises

  • Implement aggregation methods
  • Calculate accurate accounting statistics from exported monitoring data

Monitoring tools

  • StatsD
  • Graphite/Grafana
  • Circonus

Monitoring tools exercises

  • Visualization data in various ways
  • Data aggregation (percentiles)
  • Time series forecasting
  • Filtering
  • Anomaly detection

About your instructor

Photo of Heinrich Hartmann

Heinrich Hartmann is the lead data scientist at Circonus, where he is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. Previously, he worked as a researcher for the University of Oxford. Heinrich holds a PhD in mathematics from the University of Bonn.

Twitter for HeinrichHartman

Conference registration

Get the Platinum pass or the Training pass to add this course to your package.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)