Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

Our many monitoring monsters

Megan Anctil (Slack)
4:35pm–5:15pm Thursday, June 22, 2017
Monitoring, Tracing, & Metrics
Location: LL20 A/B
Level: Intermediate
Average rating: ****.
(4.83, 6 ratings)

Who is this presentation for?

  • Operators and those building or working on monitoring solutions

Prerequisite knowledge

  • Basic experience with Icinga/Nagios, Graphite, Grafana, Elasticsearch, Logstash, and Kibana (useful but not required)

What you'll learn

  • Learn what to consider when applying open source monitoring technologies to your stack
  • Explore how Slack runs monitoring at scale

Description

One size definitely doesn’t fit all when it comes to open source monitoring solutions, and executing generally understood best practices in the context of unique distributed systems presents all sorts of problems. Megan Anctil shares pain points and lessons learned at Slack wrangling known technologies such as Icinga, Graphite, Grafana, and the Elastic Stack to best fit the company’s use cases.

Slack uses a few well-known monitoring tools but its Technical Operations team isn’t large enough to build an in-house solution for all of these. Nor does the team think it’s sustainable to throw money at the problem, given the volume of information processed and the not-insignificant price and rigidity of many vendor solutions. With thousands of servers across multiple regions and millions of metrics and documents being processed and indexed per second, the team had to figure out how to scale these technologies to fit Slack’s needs.

On the backend, they experimented with multiple clusters in both Graphite and ELK, distributed Icinga nodes, and more. At the same time, they’ve tried to build usability into Grafana that reflects the team’s mental models of the system and have found ways to make alerts from Icinga more insightful and actionable. Megan explores the team’s experience and outlines a framework for building out your own special monitoring snowflakes.

Photo of Megan Anctil

Megan Anctil

Slack

Megan Anctil is a senior engineer on the Technical Operations team at Slack. She enjoys deep dives in debugging and long walks on the beach with her #MonitoringLove(s).

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)