What Should I Monitor, And How Should I Do It?

Operations, Sutton South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ***..
(3.89, 27 ratings)

Today’s monitoring tools largely offer two functionalities: health detection and time-series metrics. In my experience, most people try to use them without realizing that the tail is wagging the dog. As operations professionals, we should decide for ourselves what’s important to monitor, and then find or build tools that can do it for us. Unfortunately, available tools are basically all variations on the same square wheels we’ve been using for decades. We need clear requirements so we can build better tools.

What do we need? I argue that due to the fast-growing complexity and size of systems managed by slowly growing IT budgets and staff, we need tools that let us observe large numbers of servers and services with low mental impedance, detect faults reliably, surface relevant information, and assist in diagnosis.

How do our tools help us? Nagios is pretty good at alerting whether a system is alive or dead, but those kinds of failures are relatively rare. Most system faults are partial, transient, hard to observe, infinitely variable, and important to catch early while they’re small, before things turn ugly. Nagios isn’t good at this because it’s coarse-grained, threshold-driven, and you have to tell it what you think a problem is going to look like. In other words, Nagios can’t catch the vast majority of system faults you should care about. Most fault detection tools are in the same camp: alive-or-dead checks, and metrics-versus-thresholds.

Graphite represents the other big chunk of monitoring functionality commonly available: capturing, storing, and visualizing time-series metrics. This is primarily historical data, usually intended for diagnosis purposes, sometimes for forecasting and planning. As a diagnosis tool, though, a page full of graphs isn’t helpful. If you’re chasing a hard problem in a large system, you’ve got tens of thousands of metrics to look through. “Here, have some charts” just isn’t a good diagnosis tool.

So much for the status quo. What do I think we should do differently? Here’s a sample of the topics I’ll cover as I try to convince you we can and should do better:

  • We should measure the system’s work, not just its status. Work is the system’s raison d’etre.
  • Generic, dumb tools aren’t enough; we need to know the meaning of the metrics.
  • Some metrics are of central importance. Everything else is for reference only. What are the core metrics?
  • What’s the difference between correlation and cause, and how can we determine it?
  • We need high resolution — one-second at a minimum. One-minute or five-minute is useless.
  • Fault detection should be based on whether work is getting done, again, in high resolution.
  • Graphs have no intrinsic meaning. Don’t stare at a graph and wonder what it means. That’s a backwards process.
  • Abnormality detection isn’t very useful at fine granularity, because systems are constantly abnormal.
  • End-user monitoring is great for detection, but not for diagnosis.
  • There are significant technical challenges to building capable tools, and open-source software currently leaves a lot of gaps that we need to fill.
  • Large-scale modeling and correlation, machine learning, AI, and so on have uses, but it isn’t one-or-the-other. We can do a lot better than our crude tools today, without needing that kind of sophistication.

I believe that we don’t have good monitoring tools because we’ve been tackling the wrong problems. As Steven Covey said, someone needs to climb the tallest tree and shout down “We’re in the wrong jungle!” Let’s start a conversation about what the right jungle is, and then spend some time sharpening our machetes, before we hack at the weeds.

Photo of Baron Schwartz

Baron Schwartz


Baron is co-founder and CTO of VividCortex, a provider of SaaS database administration tools. He is the lead author of High Performance MySQL and continues to research and publish under the O’Reilly imprint. He has created several open-source software tools, including Maatkit, and has authored features for MySQL and InnoDB. He is an Oracle ACE, and the founder of the worldwide OpenSQL Camp conference series. He holds a degree in Computer Science from the University of Virginia. Baron lives in Charlottesville, Virginia with his family.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Gloria Lombardo at (203) 381-9245 or glombardo@oreilly.com

Media Partner Opportunities

For media partnerships, contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Velocity contacts