Training: June 20–21, 2016
Tutorials: June 21, 2016
Keynotes & Sessions: June 22–23, 2016
Santa Clara, CA

Debugging distributed systems

Donny Nadolny (PagerDuty)
3:40pm–4:20pm Thursday, 06/23/2016
DevOps
Location: Mission City Ballroom M1 - 2 Level: Intermediate
Average rating: *****
(5.00, 5 ratings)

Despite our best efforts, our systems fail. Sometimes it’s our fault—code that we wrote, bugs that we caused. But sometimes the fault is with systems that we have no direct control over. Distributed systems are hard. They are complicated, hard to understand, and very challenging to manage. But they are critical to modern software, and when they have problems, we need to fix them.

ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed a lot. Donny Nadolny looks at what it takes to debug a problem in a distributed system like ZooKeeper, walking attendees through the process of finding and fixing one cause of many of these failures. Donny explains how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you will want to know about TCP, including an example of machines having a different view of the state of a TCP stream.

If you are interested in distributed systems and how they can fail, this session is for you.

Photo of Donny Nadolny

Donny Nadolny

PagerDuty

Donny Nadolny is a Scala developer at PagerDuty working on improving the reliability of its backend systems. Donny spends a large amount of time investigating problems experienced with distributed systems like Cassandra and ZooKeeper.