As systems grow, they get more components—and more ways to fail. The alerts of the last system’s design can slowly “boil the frog,” and suddenly no one has time to help the system scale further because they’re constantly firefighting. Alert fatigue sets in, and the team burns out.
Jamie Wilkinson offers an overview of SLOs and the concept of the error budget, a study of the motivation to move away from cause- to symtom-based alerting, and demonstrates how to implement it in your own projects. By only paging when the SLO is not met or when the error budget is being burned at a predetermined rate, you can avoid alert fatigue and keep your team ready for action when it counts. You’ll learn about alerting on your SLOs and error budget, how the implementation of that changes as systems scale, and the tools you’ll need once the alerts themselves no longer tell you what part is broken.
Jamie Wilkinson is a site reliability engineer at Google. He is a contributing author to the SRE Book and has presented on contemporary topics at prominent conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and SRECon. His interests began in monitoring and automation of small installations, but continues with human factors in automation and systems maintenance on large systems. Despite his more than 15 years in the industry, he is still trying to automate himself out of a job.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org