Quantifying Abnormal Behavior

Baron Schwartz (VividCortex)
Operations, Mission City Ballroom B4
Average rating: ****.
(4.54, 13 ratings)

Monitoring systems that collect metrics usually support nice features like flexible graphing, and in some cases, even more advanced options such as trending and Holt-Winters Forecasting. But alerting is usually very primitive in comparison. Typical alerting systems do a static health check based on a preconfigured threshold, which we all know is never the right number — it’s just something that seems as reasonable as possible. The result: you get false positives when nothing’s wrong, and you don’t get alerts when something’s abnormal.

Why is this? The root cause is the primitive notion of healthy or sick. System health simply can’t be defined by a threshold (“85% CPU, oh noes!”). It needs to be based on three things: knowing how the system usually behaves, knowing how much the system is deviating from normal, and knowing whether the deviation is actually bad.

What if we could calculate normal behavior in real-time? If your thoughts jump to a Hadoop cluster and some kind of impressive Big Data processing, think again. There’s a way to do this that’s as simple as a couple of basic arithmetic operations on each incoming metric — just a few CPU cycles! It involves simple stuff you probably already know.

Here’s what’s really surprising: when you can actually measure normality as a number (instead of tri-valued OK/WARN/CRIT logic), you can do all kinds of useful stuff with it. Imagine tracking the normality as a metric itself, so you can quickly compare your system’s behavior to its historical performance, e.g. is the system behaving less consistently after the latest release was deployed. There’s a lot more you can do. This is really powerful juju.

Here’s an outline of the topics we’ll cover.

  • Quick introduction, covering some of the above in a little more depth
  • Techniques from operations research, including Shewhart control charts and Brownian motion (run lengths)
  • Little’s Law and its relevance to “bad” versus “good” abnormalities
  • Neil J. Gunther’s Universal Scalability Law model and its relationship to Little’s Law
  • Basic forecasting techniques, including Holt-Winters Forecasting
  • Why control charts, Holt-Winters, trending, etc are NOT the right approach
  • Metrics that matter: throughput, concurrency, response time, backlog, utilization
  • What to do with the riches you’ve just uncovered

This presentation has a little bit of math, but you won’t need to think hard to understand it (trust me). I’ll show lots of pictures and explain everything with simple concepts like concurrency and how much work is queued up in a system. And I will share my slides.

Photo of Baron Schwartz

Baron Schwartz


Baron is co-founder and CTO of VividCortex, a provider of SaaS database administration tools. He is the lead author of High Performance MySQL and continues to research and publish under the O’Reilly imprint. He has created several open-source software tools, including Maatkit, and has authored features for MySQL and InnoDB. He is an Oracle ACE, and the founder of the worldwide OpenSQL Camp conference series. He holds a degree in Computer Science from the University of Virginia. Baron lives in Charlottesville, Virginia with his family.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Gloria Lombardo at (203) 381-9245 or glombardo@oreilly.com

Media Partner Opportunities

For media partnerships, contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Velocity contacts