Google’s customer reliability engineering team is a specialized group of SREs who go into the world and teach enterprise customers of public cloud infrastructure—via their actual production systems—how to “do SRE” in their orgs. In the team’s two years of existence, its members have found that some things they thought would be hard weren’t, while others were nigh on impossible. The team has written many postmortems and learned a bunch of lessons you can only learn the hard way. Liz Fong-Jones and Dave Rensin share eight of these key lessons.
Liz Fong-Jones is a staff site reliability engineer on the Google Cloud customer reliability engineering team at Google in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.
Dave Rensin is the director of customer reliability engineering (CRE) at Google. His team takes Google SREs focused on the reliability and availability of internal Google systems and focuses them on the reliability and availability of customer production systems running on Google Cloud. His mission is to teach Google customers how to design, build, and run highly available systems using Google SRE practices and tools. Dave is the author of several books, including two for O’Reilly, and holds more than a dozen patents in distributed systems, data acquisition, access control, and pattern matching.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org