What does uptime really mean for your system? An end to end (e2e) check is where the rubber hits the road for your user experience, and is the operator’s best tool for measuring “uptime” as experienced by your users. Creating and evolving e2e checks also establishes a basis for defining the SLOs and SLIs that we are willing to support.
In this workshop, we’ll start off by talking about what goes into defining and running a good e2e check, then telling some stories of the lessons learned while writing e2e checks for services we’ve run in the past.
We’ll write one together against a common API we can all access (e.g. a small server driving a Phillips Hue bulb, in the front of the room), and use the simple lightbulb server as a touchpoint from which to gauge “correctness” of the system. (We’ll pause to write an e2e check together for the server, in whichever language/environment you prefer. Because of the nature of the server — a publicly available API driving a lightbulb turning on and off — it’ll be a fun and interactive way to see the progress of your classmates.)
Once most folks have something that works, we’ll pause to talk about capturing, visualizing, and alerting on results: e.g. What’s useful to capture? What metadata should we have along the way? What existing paging alerts are obsoleted by an effective e2e check?
Then, stage two — we unveil a new, extended version of our lightbulb server, with multiple light bulbs representing a sharded backend. We’ll have a quick conversation about the more complex backend changes the accuracy of our existing e2e checks and how it’ll have to evolve as a result.
We’ll take another heads-down session for folks to update their e2e checks for the more complicated architecture, then wrap up with a discussion (if the audience is amenable) about similar real-world tradeoffs we’ve seen or had to work through.
As a backup plan, if folks either plow through the more complex e2e checks or are unwilling to engage, we’ll be prepared to do some live tweaking of the lightbulb server itself, let folks run their e2e checks, and see if we can understand how the system is failing based on the output of the e2e checks.
For the last 12 years, Ben has found himself building monitoring, alerting, and observability systems for companies ranging from startuppy (Simply Hired and Parse) to top-10 (Wikimedia and Facebook). Strangely, he actually enjoys this work and is happy to finally be building a company that will help tease out nuances in data that seem to be missing from all the other crappy open source systems he’s used. Though unlikely to pass on a good scotch, he’ll reach for the bourbon or rye first.
Christine Yen is the cofounder of Honeycomb, a startup with a new approach to observability and debugging systems with data. Christine has built systems and products at companies large and small and likes to have her fingers in as many pies as possible. Previously, she built Parse’s analytics product (and leveraged Facebook’s data systems to expand it) and wrote software at a few now-defunct startups.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com