Monitoring is the foundational bedrock of site reliability yet is the bane of most sysadmins’ lives. Why? Monitoring sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools like Riemann and Prometheus have emerged to address this problem by scaling out monitoring configurations sublinearly with the size of the system.
In a talk complementing the Google SRE book chapter “Practical Alerting from Time Series Data,” Jamie Wilkinson explores the theory of alert design and time series-based alerting methods and offers practical examples in Prometheus that you can deploy in your environment today to reduce the amount of alert spam and help operators keep a healthy level of production hygiene.
Jamie Wilkinson has been a site reliability engineer at Google for over 11 years but is still trying to automate himself out of a job.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com