Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

The holy grail of systems analysis: From what to where to why

Ben Sigelman (LightStep)
3:40pm–4:20pm Wednesday, June 21, 2017
DevOps & Tools
Location: LL21 A/B
Level: Intermediate
Average rating: *****
(5.00, 5 ratings)

Who is this presentation for?

  • Engineering leaders, developers, and DevOps engineers with experience building and maintaining production systems

Prerequisite knowledge

  • A basic understanding of distributed systems

What you'll learn

  • Understand why latency problems in distributed systems are resource contention issues and how distributed tracing can help
  • Explore a demo showing how an instrumented application can be debugged using tracing and time series monitoring


Sudden latency regressions in distributed systems are almost always due to throughput-driven contention or queueing at some choke point. As such, the root cause of transaction latency depends on other transactions that are gumming up the works. How can we find the root cause of these interference effects explicitly and without guesswork? And how does that scale to microservice architectures, where each transaction crosses hundreds of process boundaries before making its round-trip?

Solving this problem is the “holy grail” of system analysis, and recent advances in distributed tracing technology bring it within reach of software engineering today. Ben Sigelman explains why this workflow could change the way we understand critical-path latency in distributed systems. Ben begins with a quick summary of the approach Google’s Dapper took with distributed tracing system in the mid-2000s, discussing the limits of its design and its fundamental inability to find the root cause of most contention-related latency issues. Ben then contrasts this with the new world order, where some monitoring technologies can observe a distributed system with full fidelity. Ben then leads an audience-participation demo that connects the dots from a high-latency outlier request to the contended resource it’s waiting on. This workflow is direct, clear, and replaces an entire bevy of other complex and expensive tooling.

Photo of Ben Sigelman

Ben Sigelman


Ben Sigelman is the cofounder and CEO of LightStep, where he’s building reliability management for modern systems. An expert in distributed tracing, Ben is the coauthor of the OpenTracing standard, a project within the Linux Foundation’s Cloud Native Computing Foundation (CNCF). Previously, he built Dapper, Google’s production distributed systems tracing infrastructure, and Monarch, Google’s fleet-wide time series collection, storage, analysis, and alerting system. Ben holds a BSc in mathematics and computer science from Brown University.

Comments on this page are now closed.


06/21/2017 11:55am PDT

Hi Ben, thanks for the great talk today! Is there any reading material covering the solution you proposed for resource contention detection?