Skip to main content

It's 3AM, Do You Know Why You Got Paged?

Ryan Frantz (Etsy)
Operations
Location: 211 Level: Intermediate
Average rating: ****.
(4.52, 21 ratings)

You’re paged in the middle of the night. You wake up, confused and fumbling for your phone. You acknowledge the alert. Then you try to figure out exactly why you got paged in the first place, fighting the siren call of your pillow. Hopefully you’re lucid enough to begin addressing the issue. What if alerts automatically included more information, more context about what was happening? Especially if it’s information you would otherwise take time to gather? Computers can, and should, do as much work as possible for us before they page us. The graphs you look at, the commands you run, the information they provide can be incorporated into alerts to make them more meaningful and potentially decrease an alert’s Mean Time to Resolution.

In this presentation I’ll discuss the efforts Etsy’s Operations team put into contextualizing alerts to make them more useful. I’ll also review the nagios-herald project, a tool that integrates with Nagios to provide even more valuable content in alert notifications. For example, starting with the canonical disk space check we embedded 24-hour graphs in alerts to understand if disk growth was spontaneous or gradual; we’ve included Splunk query results that show the frequency of the alert to determine if it’s a recently recurring event or a one-off; we can also inform the on-call engineer why the alert triggered by highlighting where within a check’s thresholds the alert fell. Future alerts will include correlated information such as business metrics that may see an impact when an alert fires so that the on-call engineer can clearly communicate to affected stakeholders about alerting events.

nagios-herald (currently under review to be open sourced) is extensible and provides a framework for adding temporal context into alerts via features such as:

  • Helpers – These are handy functions that perform queries against common tools (such as Ganglia, Graphite, and Splunk), download images, and more. The output generated by helpers is added to alerts to provide meaningful context around an event.
  • Formatters – These are used to format the content of the alert including colorizing text, inlining images, or attaching files. These aid in making content more legible (useful at 3AM!).

This presentation will provide several examples of in-production alerts before and after they were contextualized to illustrate how providing more relevant information in notifications can help identify problems. I’ll also demonstrate how nagios-herald has increased the usefulness of our alerts, share anecdotes from fellow engineers about how their on-call rotations improved as a result, and get you excited to be on-call again so that you too can make the experience more bearable.

Photo of Ryan Frantz

Ryan Frantz

Etsy

Ryan is a Senior Operations Engineer at Etsy. He loves to solve puzzles, play with his kids’ Legos, and learn new things. Like the harmonica.