As systems grow, they get more components—and more ways to fail. The alerts of the last system’s design can slowly “boil the frog,” and suddenly no one has time to help the system scale further because they’re constantly firefighting. Alert fatigue sets in, and the team burns out.
Jamie Wilkinson offers an overview of SLOs and the concept of the error budget, a study of the motivation to move away from cause- to symtom-based alerting, and demonstrates how to implement it in your own projects. By only paging when the SLO is not met or when the error budget is being burned at a predetermined rate, you can avoid alert fatigue and keep your team ready for action when it counts. You’ll learn about alerting on your SLOs and error budget, how the implementation of that changes as systems scale, and the tools you’ll need once the alerts themselves no longer tell you what part is broken.
Jamie Wilkinson has been a site reliability engineer at Google for over 11 years but is still trying to automate himself out of a job.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org