Build resilient systems at scale
28–30 October 2015 • Amsterdam, The Netherlands

Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Sarah Wells (Financial Times)
13:45–14:25 Thursday, 29/10/2015
Location: Emerald Room
Average rating: ****.
(4.44, 39 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

Attendees will be getting started with microservices or thinking about doing it, so with an understanding of what they are and why they can work. Some knowledge of the sorts of tools used for alerting and monitoring (e.g. Splunk, nagios) would be good.

Description

Microservices can be a great way to work: the services are simple, you can use the right technology for the job, and deployments become smaller and less risky. Unfortunately, other things become more complex. You probably took some time to work out how you were going to quickly spin up, deploy, and run new services, infrastructure and deployment automation, for example.

But did the rest of your thinking about what “done” means catch up? Are you still setting up alerts, run books, and monitoring for each microservice as though it was a monolith?

Six months into building a new microservices architecture, we had 25 microservices, each in three environments, some with multiple datacentres, and we’d got to the point where an underlying network issue could mean 20 people each getting 10000 alert emails overnight. With that volume, you can’t pick out the important stuff. In fact, your inbox is unusable, or you have everything filtered away where you’ll never see it.

Furthermore, you have information radiators all over the place, but there’s always something flashing or the wrong colour. You can spend the whole day moving from one attention-grabbing screen to another. So how do you get yourselves out of that mess and regain control of your inbox and your time?

First, you have to work out what’s important, and then you have to ruthlessly narrow down on that. You need to be able to see just the things you need to take action on, in a way that tells you exactly what you need to do. I’ll share how a team at the Financial Times did this and some tips and tricks.

Photo of Sarah Wells

Sarah Wells

Financial Times

Sarah Wells is the technical director for operations and reliability at the Financial Times. Her teams build operational and developer tooling and help engineering teams at the FT to support the systems they build, including coordination, communication and learning around major incidents. Previously, Sarah was a developer and tech lead for nearly 20 years. Building a new microservices-based system about five years ago led her to develop a deep interest in operability, observability, and DevOps—and learn a lot about containerization, Kubernetes, and Go in the process.