Build resilient systems at scale
May 27–29, 2015 • Santa Clara, CA

Building self-healing systems

Todd Minnella (conDati, Inc.), Matt Solnit (SOASTA, Inc.)
1:30pm–3:00pm Wednesday, 05/27/2015
Location: Mission City M1-2
Average rating: **...
(2.11, 19 ratings)

Prerequisite Knowledge

We would recommend that attendees be comfortable with a Linux command-line, and have some familiarity with the programmatic use of service APIs.

Materials or downloads needed in advance

Attendees may want to bring: * a mobile phone or laptop with a web browser * a device for taking down notes and ideas


In this tutorial we will walk attendees through setting up three examples of self-healing behaviors using a typical front-end Linux system running Java. Our examples will illustrate what is possible using current tools and APIs.

This session is targeted at entry- to mid-level technology professionals working in a DevOps role. We will implement our techniques on a demonstration system, and will demonstrate each technique working in real time.

As part of this tutorial, we will cover the following:

  • Challenges faced by distributed systems
  • Infrastructure characteristics that can mitigate failure impact
  • Benefits and risks of self-healing behaviors
  • Testing requirements
  • Ideas for automation opportunities

We’ll cover three examples of self-healing behaviors that, together, represent an arc of capabilities that can help make running and managing a large, reliable distributed system possible without needing a large operations staff. Our system automation examples will demonstrate the following, each triggered in response to a monitor or metric that uses an appropriate threshold:

  • An external script triggering a full garbage collection of a Java process
  • Removing a system from service (using a DNS API), rebooting it, validating it, and returning it to service
  • Identifying a problem that a restart did not fix, collecting log files, and logging a defect

Following our examples and demonstrations, we will discuss our real-world experiences with these and similar techniques. Attendees at this tutorial should walk away with a better understanding of what the tools of today make possible.

We hope to instill a can-do attitude – these techniques are practical to implement, and can help make your operations team look like efficient superheroes!

Photo of Todd Minnella

Todd Minnella

conDati, Inc.

Todd is a web operations professional with experience in designing, building, and managing IT infrastructure for both central services and externally-facing applications. He’s been helping people to be more productive with their computers and IT systems for over 25 years (his first gig was supporting Apple IIes and IIcs at a computer summer school), and his favorite operating system is Tru64.

Todd is currently leading a team responsible for the SaaS operations of SOASTA. As part of his job, he enjoys creating and supporting systems that are so stable that he can sleep without early-morning chirps from his phone.

Photo of Matt Solnit

Matt Solnit


Matt Solnit is VP of engineering for server-side infrastructure at SOASTA, the leader in web and mobile performance analytics.