Skip to main content

Signal Through the Noise: Best Practices for Alerting

David Josephsen (Librato)
Operations
Beekman
Average rating: ****.
(4.73, 15 ratings)

In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability.

Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice.

This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like:

  1. Defining organizational policy about where and when to set alerts.
  2. Ensuring the on-call engineer is armed with the proper information to take action
  3. Best practices for configuring an alert
  4. Fire-fighting after an alert has triggered
  5. Performing analysis across your organization wide history of alerts
Photo of David Josephsen

David Josephsen

Librato

David Josephsen is the “Developer Evangelist” at librato.com. His continuing mission: To help engineers world-wide close the feedback loop. He is also a sometime book-authoring blogger and purveyor of awkward conference talks.

He has never (not even once) used non-local goto. He speaks shell, Go, C, Python, Perl and a little bit of Spanish (in that order), and apologizes in advance for that thing he said (not the first thing, the other one. That first thing he totally meant to say and he refuses to redact it).