It’s arguably the case that failure, in any form, is the most significant reason that distributed systems are difficult. Without the possibility of failure many traditional challenges in distributed computing, such as replication or leader election, become much simpler, much more performant, or both. But failure, of machines, processes and people, remains an unavoidable reality.
In this talk I want to demonstrate three things. First, why failure makes distributed systems design hard. Second, why understanding the root cause of a failure or outage is vital to operators of large distributed systems. Third, why doing that root cause analysis is itself difficult – because of the problems with understanding causality in distributed systems – and how we’ve had some success at Cloudera treating it as a big-data problem.
I’ll explain the role that failure plays in distributed systems design quickly, by showing how complex operations become trivially simple when the possibility of failure is removed.
I’ll motivate the problem of root-cause analysis by showing how bugs can be diagnosed after the fact, and repeat behaviour avoided, once we know what caused an incident, supported by anonymised examples that we have seen at Cloudera.
The key to understanding failures is knowing what event caused what – the causal relationship between incidents. Unfortunately, Hadoop and other systems do a poor job of sharing causal relationships, and doing so in general is fundamentally hard due to the lack of perfectly synchronised clocks.
In lieu of knowing the causal relationships between components, we have to try and infer them from correlations that we see between disparate signals, from log files to user actions to operating system monitoring. This data is readily available, but huge! The challenge is to have this data help us in forming hypotheses about causal links, which we can then validate. This can be cast as a big data analysis problem of searching for the most likely causal relationships between millions of seemingly independent events. I’ll show two ways we can attack this problem: by visualisation and by algorithm.
Finally I’ll show how the community can help this effort, by building tracing tools that make some causal relationships explicit, and therefore drastically cut down the amount of searching we have to do.
Henry Robinson is a software engineer at Cloudera, where he works on their management and monitoring infrastructure for Apache Hadoop. He has a background in distributed systems, and is also a PMC member for the Apache ZooKeeper distributed coordination platform.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at firstname.lastname@example.org.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts