Fueling innovative software
July 15-18, 2019
Portland, OR

If only production incidents could speak

Subbu Allamaraju (Expedia Group)
11:00am11:40am Thursday, July 18, 2019
Secondary topics:  Cloud Native
Average rating: ****.
(4.55, 11 ratings)

Who is this presentation for?

  • SREs, architects, and systems engineers




Most enterprises undergoing large-scale transformations face a simple reality—the systems serving customers and the business are a mixture of rapidly changing microservices, legacy monoliths, shared databases, cloud native services, and legacy pets. There’s never enough time to get the house in order. Amid such change and heterogeneity, gaining an upper hand on resilience is far from simple. Contemporary slogans like “automate everything” or “inject faults to find faults” come up with cost and risk, particularly when you have a mixture of old and new, some fast-changing and some slow-changing systems.

There’s no better way to understand failures than to examine trends and patterns from real-world failures. An analysis of several hundred incidents shows that changes, config drift, and latent failures are more common than hardware and network failures. Subbu Allamaraju walks you through such an analysis to observe some patterns that hint at potential ways of getting better. In particular, you’ll look at practices for continuous improvements.

Most incidents are triggered by changes. Subbu explores the importance of risk safety and why it’s essential to mix safety with change and provides some examples of how to do so. He also shares why traditional disaster recovery practices don’t work and why it’s better to have redundant systems over disaster recovery—analysis of some severe incidents show that drift, interaction complexity, and loss of team memory contribute to incidents more frequently than natural disasters or data center mishaps. You’ll learn how tailored chaos engineering might make sense to balance between safety and validation. In particular, you’ll discuss three types of chaos tailoring practices (drift detection, incident validation, and redundancy validation) that might help continuous improvements while maintaining system safety. Join in to get comfortable with uncomfortable failures using practices to increase team preparedness to understand the physics of complex systems.

Prerequisite knowledge

  • A basic understanding of distributed systems and production environments

What you'll learn

  • Learn to tailor contemporary best practices for your environment
Photo of Subbu Allamaraju

Subbu Allamaraju

Expedia Group

Subbu Allamaraju is the vice president of technology at Expedia Group, where he leads a large-scale migration of Expedia’s travel platforms from enterprise data centers to a highly available architecture in the cloud. Subbu is a well-rounded engineer and influencer with hands-on experience in software development, architecture, distributed systems, services, internet protocols, operations, and the cloud. Previously, he helped build and empower several engineering and operations teams in these areas.