Build resilient systems at scale
28–30 October 2015 • Amsterdam, The Netherlands

Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Sarah Wells (Financial Times)
13:45–14:25 Thursday, 29/10/2015
Location: Emerald Room
Average rating: ****.
(4.44, 39 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

Attendees will be getting started with microservices or thinking about doing it, so with an understanding of what they are and why they can work. Some knowledge of the sorts of tools used for alerting and monitoring (e.g. Splunk, nagios) would be good.


Microservices can be a great way to work: the services are simple, you can use the right technology for the job, and deployments become smaller and less risky. Unfortunately, other things become more complex. You probably took some time to work out how you were going to quickly spin up, deploy, and run new services, infrastructure and deployment automation, for example.

But did the rest of your thinking about what “done” means catch up? Are you still setting up alerts, run books, and monitoring for each microservice as though it was a monolith?

Six months into building a new microservices architecture, we had 25 microservices, each in three environments, some with multiple datacentres, and we’d got to the point where an underlying network issue could mean 20 people each getting 10000 alert emails overnight. With that volume, you can’t pick out the important stuff. In fact, your inbox is unusable, or you have everything filtered away where you’ll never see it.

Furthermore, you have information radiators all over the place, but there’s always something flashing or the wrong colour. You can spend the whole day moving from one attention-grabbing screen to another. So how do you get yourselves out of that mess and regain control of your inbox and your time?

First, you have to work out what’s important, and then you have to ruthlessly narrow down on that. You need to be able to see just the things you need to take action on, in a way that tells you exactly what you need to do. I’ll share how a team at the Financial Times did this and some tips and tricks.

Photo of Sarah Wells

Sarah Wells

Financial Times

Sarah Wells is technical director for operations and reliability at the Financial Times. A developer with 15 years of experience, Sarah has led delivery teams across consultancy, financial services, and media. Over the last few years, she has developed a deep interest in operability, observability, and DevOps. Previously, she led work on FT’s semantic publishing platform, which makes it easy to discover and access all the FT’s published content via APIs in a common and flexible format. That project focused on Go, microservices, containerization, Kubernetes, and how to influence teams to do the right things.