The world of microservices and distributed systems is complex. There are now more systems to keep an eye on—and more ways they can go wrong. We need to be able to understand what these systems are doing, especially when things break. The traditional solution is logs: log everything, tune your log threshold just right, and away you go. But if you’re investigating an incident, you won’t find what you need if the log threshold wasn’t already set low enough before things went wrong. Just turning down the log threshold permanently isn’t the solution either. If you’re running a high-volume system, you’ll have a correspondingly high volume of logs being produced, which gets out of hand very quickly.
The value of logs is in what questions you can answer with them: How busy is the system? How healthy is it? How is it performing for specific customers? But logs aren’t actually a good way of answering these sorts of questions. Logs are designed for humans to read, but our logs are no longer human scale; they are machine scale, so we need machines to help us make sense of them.
Sam Stokes explains that we need new, better tools and why this will also require us to design our systems to give the tools better data. What if instead of emitting logs for humans to read, we emitted events for machines to analyze? What would those events look like? What sort of hints might we give to the machine? What sort of questions could we ask?
Sam Stokes is a software engineer who can’t leave well enough alone. He’s compelled to fix broken things, whether they are software systems, engineering processes, or cultures. After watching too many systems catch fire, he’s building better smoke detectors at Honeycomb; in a past life, he cofounded Rapportive and built recommendation systems at LinkedIn.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com