Skip to main content

Mean Time to Sleep: Quantifying the On-Call Experience

Ryan Frantz (Etsy), Laurie Denness (Bloomberg LP)
Operations
Mission City Ballroom B4
Average rating: ****.
(4.70, 27 ratings)
Slides:   external link

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

In this presentation, we’ll briefly discuss our motivations behind quantifying the on-call experience including:

  • Feeling like the on-call rotation was rough, but not being able to clarify why
  • Sensing we’d had repeat service failures that could be mitigated but not being able to confirm it
  • Needing to demonstrate to those not on-call why it is important to put hosts in downtime before performing work
  • Waking up to phantom pages because we expected alerts to fire

We’ll highlight the methods and tools we’ve used to gather and present data such as:

  • Automatically querying Nagios data to gather metrics on alerts, helping us define baselines for their frequency
  • Reporting on the number of alert events and their severity, the top alerting events, and the actions taken to resolve events, allowing us to visualize the distribution of alerts and provide a starting point to reduce alert fatigue (image: Alert Summary)
  • Graphing the relationship between an alert, an engineer waking, and an engineer returning to bed to calculate the Mean Time to Sleep (MTTS) and possibly understand the impact to productivity while on call (read Nagios, Sleep Data, and You on Code as Craft)
  • Generating weekly on-call hand-off reports that highlight the past week’s events and attempt to forecast potential issues for common types of alerts so that engineers about to go on call have an idea of what issues to prepare for, and possibly proactively mitigate them

Regular reviews of on-call reports lead to improving the experience, including a 30% decrease in the number of un-actionable alerts over the course of nearly a year (image: Alert Overview for Year-to-date). The reports identified problems for which we could now take action and correct, including:

  • Repeat service outages whose mitigation efforts could be prioritized and planned and no longer recur
  • Non-downtimed systems needlessly alerting engineers; updating our tooling to simplify setting downtime for hosts reduced these types of alerts by 50%
  • Alerts for which no action could be taken and were therefore configured to only send email, rather than wake a sleeping engineer

While we can’t eliminate the anxiety an engineer may feel when going on call, we have surfaced several unknowns about the experience that are actionable. It is our hope that sharing what we’ve learned will help others gain visibility into their own on-call rotations and give them a better idea of what to expect.

Photo of Ryan Frantz

Ryan Frantz

Etsy

Ryan is a Senior Operations Engineer at Etsy. He loves to solve puzzles, play with his kids’ Legos, and learn new things. Like the harmonica.

Photo of Laurie Denness

Laurie Denness

Bloomberg LP

Operations all day and night. Previously at Last.fm, now at Etsy.com