Traditional approaches to debugging network issues across a globally distributed system are a pain, and when you’re responsible for an enormous amount of customer traffic, it’s important to tread lightly. If you make the data collection too frequent, at best, your program will take up CPU reserved for serving traffic, and at worst, you risk DDoS-ing your own servers. Data collection that is too sparse (or targets only specific caches) makes determining the quickest path and diagnosing packet loss mainly speculative.
At Fastly, conventional ping and traceroute tools are insufficient at the company’s scale, so it had to build its own. Victoria Nguyen explains how Fastly overhauled the monitoring and data collection of its globally distributed network without its caches noticing. You’ll learn how the company uses hashing to evenly balance data collection between caches within a site and collect data for each provider for best results.
It’s been a process refining what tools Fastly uses to pinpoint latency, packet loss, and quickest paths. In the latest iteration, the company used its own platform by leveraging caching, request routing in VCL, and HTTP to build more flexible monitoring and data collection tools. The system is written in Go, allowing HTTP requests to be lightweight and concurrent, and the API is wired to Slack bots so that anyone can ping or traceroute between sites without having to SSH into production servers or coordinate with each other during an incident—which of course is the only thing anyone ever cares about.
Victoria Nguyen is a network systems engineer at Fastly. She loves rock climbing and Halloween.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org