Training: June 20–21, 2016
Tutorials: June 21, 2016
Keynotes & Sessions: June 22–23, 2016
Santa Clara, CA

A practical guide to monitoring and alerting with time series at scale

Jamie Wilkinson (Google)
11:20am–12:00pm Wednesday, 06/22/2016
First time at Velocity Santa Clara, Measuring the right things
Location: Ballroom GH Level: Intermediate
Average rating: ***..
(3.27, 15 ratings)

Prerequisite knowledge

Attendees should have basic programming and arithmetic experience.

Description

Monitoring is the foundational bedrock of site reliability yet is the bane of most sysadmins’ lives. Why? Monitoring sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools like Riemann and Prometheus have emerged to address this problem by scaling out monitoring configurations sublinearly with the size of the system.

In a talk complementing the Google SRE book chapter “Practical Alerting from Time Series Data,” Jamie Wilkinson explores the theory of alert design and time series-based alerting methods and offers practical examples in Prometheus that you can deploy in your environment today to reduce the amount of alert spam and help operators keep a healthy level of production hygiene.

Photo of Jamie Wilkinson

Jamie Wilkinson

Google

Jamie Wilkinson is a site reliability engineer at Google. He is a contributing author to the SRE Book and has presented on contemporary topics at prominent conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and SRECon. His interests began in monitoring and automation of small installations and have continued with human factors in automation and systems maintenance on large systems. Despite his more than 15 years in the industry, he is still trying to automate himself out of a job.