Build & maintain complex distributed systems
October 1–2, 2017: Training
October 2–4, 2017: Tutorials & Conference
New York, NY

Persistent SRE anti-patterns: Pitfalls on the road to creating a successful SRE program like Netflix and Google

Blake Bisset (Independent), Jonah Horowitz (Stripe)
4:45pm5:25pm Tuesday, October 3, 2017
DevOps & Tools
Location: Gramercy
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Service delivery professionals

Prerequisite knowledge

  • Experience running significant service delivery efforts

Description

What isn’t site reliability engineering? Does your NOC escalate outages to your DevOps engineer, who in turn calls your packaging and deployment team? Did your Chef just sprinkle some Salt on your Ansible Red Hat and call it SRE? Lots of companies claim to have SRE teams, but some don’t quite understand the full value proposition—or what shiny technologies and organizational structures will negatively impact your operations rather than empowering your team to accomplish your mission.

Blake Bisset and Jonah Horowitz share stories about anti-patterns in monitoring, incident response, configuration management, and more that they’ve tripped over on their own teams, seen proposed as good practice in talks at other conferences, or heard in talks with peers in the industry. Blake and Jonah also explain how Google and Netflix view the role of the SRE (and how it differs from the traditional system administrator role). You’ll learn that freedom and responsibility are key, trust is required, and chaos is (sometimes) your friend.

Blake Bisset

Independent

Blake Bisset got his first legal tech job at 16. He won’t say how long ago, except that he’s legitimately entitled to make shakey fists while shouting, “Get off my LAN!” He’s cofounded three startups—a joint venture with Dupont/ConAgra, a biotech spinoff from UW, and one that started this time a bunch of kids were sitting around on New Year’s Eve, wondering why they couldn’t watch movies on the internet—only to end up spending a half-decade as an SRM at YouTube and Chrome, where his happiest accomplishment was holding the go/bestpostmortem link for several years.

Photo of Jonah Horowitz

Jonah Horowitz

Stripe

Jonah Horowitz is a site reliability engineer at Stripe, where he works with all of the company’s individual engineering teams to drive reliability efforts, including monitoring, alerting, deployment pipelines, and chaos resiliency. Previously, Jonah worked at several startups around the Bay Area, including Netflix, Quantcast (a leading ad-tech startup, where he grew the company’s network to process over three million events per second), and Looksmart (a contextual advertising company), and was on the founding team of Walmart.com (now @Walmart Labs), where he built out the company’s software deployment pipelines and its product image management systems.