Creating a scalable monitoring system that everyone will love

Molly Struve (DEV )

11:35–12:15 Wednesday, 6 November 2019

Location: Hall A6

Monitoring, Observability, and Performance

Average rating:

(4.29, 7 ratings)

Download slides (PDF)

Who is this presentation for?

SREs, DevOps practitioners, developers, and anyone who deals with monitoring systems and is looking to improve monitoring capabilities and infrastructure

Level

Intermediate

Description

A year ago, the monitoring setup at Kenna Security was a disaster. Molly Struve walks you through what the company had going to monitor its infrastructure, then dives into how to fix it.

Kenna Security used New Relic for performance monitoring, PagerDuty for application health monitoring, Elastalert used logs to alert on data discrepancies or site use anomalies, cron jobs that ran nightly or every 30 minutes looking for data anomalies, HoneyBadger for application and code errors, and admin dashboards for background processing services like Sidekiq and Resque. But the disaster didn’t end there. Not only did Kenna Security have six different tools doing the monitoring, it had them reporting to all different places: Slack channels—at its worst, it had a different slack channel for every individual environment with alerts being sent to it—SMS messaging, email, and phone calls. As if all of those different alerting mediums weren’t enough to make your head spin, the alerts sent to all of them were incredibly inconsistent. Some alerts just reported data but required no action. Many alerts would go off periodically and be false positives. And, finally, some of the alerts actually needed someone to address them immediately.

Needless to say, those who were on call were miserable. They had no idea what was important or what alerts were actionable. This wasn’t a huge problem at first, because most of the team had been around for a while and knew all the ins and outs of what alerts were relevant. However, as the team started to grow, the company realized is monitoring system needed to change. Its newly minted SRE team quickly decided one of the first problems it was going to tackle was monitoring.

Over the course of a few months, Kenna Security overhauled the entire system, and the changes paid off in spades. Molly explains the company’s four big changes.

Consolidate monitoring to a single place: Everything has to be in one place. This is especially important the larger your team gets. As more and more people join, it’s harder to onboard them if you have to teach them multiple different systems. Instead, when someone goes on call, it’s infinitely easier to tell them to open up a single webpage and that’s it. Now, you can have multiple reporting tools, but you need to send all their alerts through a single interface.

Make all alerts actionable: The moment you let one piece of noise through, you set a precedence for everything else to be ignored. Once you start letting false positives be ignored, you can very quickly forget what’s important and what’s not. If an alert goes off and there’s no action to be taken, then that alert should not have gone off in the first place. If you want things to alert that are not actionable, you need to put them in a separate place far away from the actionable items.

Make sure alerts can be muted: A lot of Kenna Security’s hand-rolled alerts in the beginning would trigger every 30, 60, or 90 minutes. Even if the team had acknowledged the alert and were working to fix it, it would still ping them. Nothing’s more frustrating than trying to fix a problem while an alarm is blaring in your ear. The single centralized system now gives Kenna Security the ability to mute alerts for however long it feels it needs to fix the problem. Not only do you want alerts to be muted, ideally, you want to be able to mute them for a specific timeframe. You don’t want to mute an alert, fix the problem, and then forget to unmute the alert afterwards.

Track alert history: This is one of those things you don’t think about until you are staring at an alert and have no idea what is causing it. A lot of times, in order to figure out the cause of an alert, you need to know what the previous behavior was. If you have history for an alert you can do this. By going back and looking for trends in data, you can get a better picture of the situation, which can help when it comes to finding the root cause. Having alert history can also help you spot trends and find problems even before an alert is triggered. For example, let’s say you’re tracking database load. If you suddenly experience a large amount of growth, you can refer to your monitoring history for that alert to gauge what the load on the database is and if you’re approaching that alert threshold. You can then use this information to get ahead of the alert before it even goes off.

Overhauling this monitoring system has paid off in many ways. For starters, on-call developers are a lot happier. By removing any ambiguity around what alerts were important and what weren’t, Kenna Security took a lot of confusion out of being on call and removed a lot of noise. No one wants their phone buzzing all night long when they’re on call. Removing those false positives fixed this issue. Since all of the monitoring is now in a single place, it’s straightforward and easy for developers to understand and learn. This ease of use has caused a lot of developers to contribute to the alerting effort by making their own alerts and improving on the ones already in place. Having a reliable, easy to use system gave developers a good reason to buy into it and join the effort to improve it.

Prerequisite knowledge

Experience with a monitoring system

What you'll learn

Identify basic strategies you can implement in your own monitoring systems to make them more scalable and user friendly for the engineers involved

Molly Struve

DEV

Molly Struve is the lead site reliability engineer at DEV. During her time in the software industry, she’s had the opportunity to work on some challenging problems. These include scaling Elasticsearch, sharding MySQL databases, and creating an infrastructure that can grow as fast as a booming business. When not making systems run faster, she can be found fulfilling her need for speed by riding and jumping her show horses.