Chaos engineering: When the network breaks (sponsored by Gremlin)

Ho Ming Li (Gremlin)

11:35–12:15 Thursday, 7 November 2019

Location: M1

Who is this presentation for?

Site reliability engineers, network engineers, system admins, and network admins

Level

Beginner

Description

Chaos engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally break things on purpose to learn how to build more resilient systems.

Ho-Ming Li leads a walk-through of network chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll identify new ways to use it to improve the resilience of your network and services. You’ll also discover how other companies are using chaos engineering and the positive results the companies have had using chaos to create reliable distributed systems.

Ho-Ming Li explains chaos engineering, its principles and why many engineering teams (including Netflix, Gremlin, Dropbox, National Australia Bank, Under Armour, Twilio, and more) use chaos engineering, as well as how every engineering team can use it to create reliable systems. You’ll learn how to get started using chaos engineering with your own team as you explore the tools to measure success and the chaos tools and new chaos features built into cloud services. You’ll also discover how to use war-game environments to learn about chaos engineering and how to practice chaos engineering on AWS DocumentDB, AWS DynamoDB, AWS RDS, and AWS S3. And you’ll be introduced to how to use monitoring tools combined with chaos engineering to help create reliable distributed systems, where you can learn more, and how to join the chaos community.

This session is sponsored by Gremlin.

Prerequisite knowledge

A basic understanding of production environments and the infrastructure required to run systems
Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting
Familiarity with chaos engineering (read “Rx onError Guidelines” for an overview)

What you'll learn

Learn to determine how and when your network breaks
Understand how network chaos engineering attacks can be used to improve the resiliency of your cloud infrastructure
Identify different types of network chaos engineering attacks, including packet loss, packet corruption, latency, and black hole

Ho Ming Li

Gremlin

Ho-Ming Li is the lead solutions architect at Gremlin. Previously, he worked at Amazon Web Services with many customers, providing guidance around architectural and operational best practices. He takes a strategic approach to deliver holistic solutions, often diving into the intersection of people, process, business, and technology. His goal is to enable everyone to build more resilient software by means of chaos engineering practices.