July 20–24, 2015
Portland, OR

Building a successful organization by mastering failure

John Goulah (Primary)
4:10pm–4:50pm Thursday, 07/23/2015
Collaboration Portland 252
Average rating: ***..
(3.36, 14 ratings)

Prerequisite Knowledge

Attendees should understand that people can and will make mistakes, but recognize that we can strive to catch those errors before they become critical. If they do become critical, we can use recovery techniques to stop or reduce the bad outcomes, and design remediation items to prevent them from occurring in the future. An open mind must be kept such that humans aren't blamed for failures. Instead we learn from them.

Description

The Etsy organization has grown by a significant amount over the last five years. As a company grows, more thought must be put into the communication techniques that it uses, and how people acquire technical proficiency using them. Mastery manifests in a variety of ways, including understanding how a system fails and recovers, which patterns make it secure, what adds or detracts from the maintainability, debuggability, or performance, best practices for spreading knowledge, and how we learn from failures.

At the organizational level we tend to achieve this primarily through tooling, though also through process and education. This talk will cover several communication techniques that have helped foster a Just Culture, one in which an effort is made to balance both safety and accountability.

The first techniques we will cover are architecture and operability reviews. Each of these has a distinct purpose in their goal. An architecture review is to understand the costs and benefits of a proposed solution, and to discuss alternatives. These exploratory conversations generally discuss technical departures to gain confidence in new and different systems that may be introduced. An operability review is there to ensure that we know when a system is working, and how we will know when it is broken. This will cover the formats we use for both meetings and questions that can be posed.

The second technique is about how we deal with failure. Anyone who has worked with technology at scale is familiar with failure. Even with all the planning and thought that goes into architecture and operability reviews, we still encounter it. By investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism, and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally if it had simply punished the actors involved as a remediation. This is what we call the Blameless PostMortem, and we’ll dive into the structure of how to approach these meetings.

As your organization grows, more thought has to be put into how people communicate. If you start to implement some of these techniques early on, they can assist with technology changes over the years, and introduce processes that deal with failure in mature ways.

Photo of John Goulah

John Goulah

Primary

John Goulah works in New York City, and has over a decade of experience scaling infrastructure for media- and e-commerce-based platforms. He strives for non-mundane tasks and has automated himself out of his last few endeavors, which has landed him in his current role as a senior engineering manager at Etsy, the leading marketplace for handmade goods. He has been working there for almost five years on developer tools and deployment infrastructure.