Thanks to the strides made in monitoring over the recent years, it has become easier and easier to watch your production systems like a hawk. But your dashboards are often just peacocks: beautiful and not really useful. In critical situations, you need to get the systems up and flying again. The symptoms of underlying issues are all there, just waiting to be plucked. And the good news is that you’re not alone in figuring it out: ducks fly together… as long as they are coordinated.
Through broad research across PagerDuty’s diverse customer base, and leveraging principles from public emergency response, a framework for understanding operational incident response has emerged. How do you quickly diagnose the severity of an incident? When is it not an incident? How do workflows differ for who to contact and when? What makes an effective Incident Commander? How does collaboration vary, and what is the role of ChatOps? In this session, we will answer questions like these in practical ways that will make a meaningful impact on how you manage incidents, from the duckiest to the fowl-est of them.
Dave (@Cliffehangers) is part of the product team at PagerDuty, which is responsible for making the lives of Dev and Ops engineers everywhere a calmer, sanity-filled reality. Outside of PagerDuty, Dave tries mightily to sleep, but his two kids thwart him at every turn.
Arup Chakrabarti has managed and built operations at Amazon and Netflix. He currently helps improve availability and reliability for his many customers as the operations engineering team lead at PagerDuty.