Build & Maintain Complex Distributed Systems
June 11–12, 2018: Training
June 12–14, 2018: Tutorials & Conference
San Jose, CA

End-to-End Observability for Fun and Profit

Ben Hartshorne (Honeycomb), Christine Yen (Honeycomb)
9:00am–12:30pm Tuesday, June 12, 2018
Location: Room 114 Level: Intermediate
Secondary topics:  Systems Monitoring & Orchestration

Prerequisite knowledge

Attendees will get the most out of this workshop if they have some production machines that they're currently responsible for keeping healthy. They should have a basic grasp of the wide range of factors that can cause an HTTP request from a client to a server to fail, and an understanding of how to interact with an API via their language/environment of choice.

Materials or downloads needed in advance

A laptop and familiarity with some API access tool of their choice. curl/jq/bash would be sufficient (but painful); a scripting language would be better. Anything that can issue HTTP requests and parse JSON responses should be sufficient. If they'd like to run the sample lightbulb server themselves, the GitHub repository will be public and will likely be written in Golang, so a working Go environment would be nice but not necessary.

Description

What does uptime really mean for your system? An end to end (e2e) check is where the rubber hits the road for your user experience, and is the operator’s best tool for measuring “uptime” as experienced by your users. Creating and evolving e2e checks also establishes a basis for defining the SLOs and SLIs that we are willing to support.

In this workshop, we’ll start off by talking about what goes into defining and running a good e2e check, then telling some stories of the lessons learned while writing e2e checks for services we’ve run in the past.

We’ll write one together against a common API we can all access (e.g. a small server driving a Phillips Hue bulb, in the front of the room), and use the simple lightbulb server as a touchpoint from which to gauge “correctness” of the system. (We’ll pause to write an e2e check together for the server, in whichever language/environment you prefer. Because of the nature of the server — a publicly available API driving a lightbulb turning on and off — it’ll be a fun and interactive way to see the progress of your classmates.)

Once most folks have something that works, we’ll pause to talk about capturing, visualizing, and alerting on results: e.g. What’s useful to capture? What metadata should we have along the way? What existing paging alerts are obsoleted by an effective e2e check?

Then, stage two — we unveil a new, extended version of our lightbulb server, with multiple light bulbs representing a sharded backend. We’ll have a quick conversation about the more complex backend changes the accuracy of our existing e2e checks and how it’ll have to evolve as a result.

We’ll take another heads-down session for folks to update their e2e checks for the more complicated architecture, then wrap up with a discussion (if the audience is amenable) about similar real-world tradeoffs we’ve seen or had to work through.

As a backup plan, if folks either plow through the more complex e2e checks or are unwilling to engage, we’ll be prepared to do some live tweaking of the lightbulb server itself, let folks run their e2e checks, and see if we can understand how the system is failing based on the output of the e2e checks.

Photo of Ben Hartshorne

Ben Hartshorne

Honeycomb

For the last 12 years, Ben has found himself building monitoring, alerting, and observability systems for companies ranging from startuppy (Simply Hired and Parse) to top-10 (Wikimedia and Facebook). Strangely, he actually enjoys this work and is happy to finally be building a company that will help tease out nuances in data that seem to be missing from all the other crappy open source systems he’s used. Though unlikely to pass on a good scotch, he’ll reach for the bourbon or rye first.

Photo of Christine Yen

Christine Yen

Honeycomb

Christine Yen is the cofounder of Honeycomb, a startup with a new approach to observability and debugging systems with data. Christine has built systems and products at companies large and small and likes to have her fingers in as many pies as possible. Previously, she built Parse’s analytics product (and leveraged Facebook’s data systems to expand it) and wrote software at a few now-defunct startups.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)