Application Resilience Engineering and Operations at Netflix

Ben Christensen (Netflix)
Operations, Mission City Ballroom B4
Average rating: ***..
(3.88, 43 ratings)

Distributed applications are complex systems full of latent failures (bugs), latency and ever changing behavior in the relationships between components. Systems easily “drift” from a state of resilience and failure can emerge from component relationships. Thus, applications (as components of a complex system) must be resilient to latency and failure on all of its system relationships and not rely upon infrastructure alone to implement this resilience.

Common resilience patterns used by Netflix in production will be shared such as:

  • Bulkhead isolation using threads and semaphores
  • Circuit breaker
  • Fail Fast
  • Fail Silent
  • Static Fallback
  • Stubbed Fallback
  • Fallback via Network Cache
  • Primary + Secondary with Fallback
  • Get-Set-Get with Request Cache Invalidation
  • Sharded Backend
  • Request Caching
  • Request Collapsing

With these common patterns we can achieve resilience to system relationships failing, but systems are complex and always changing so operating and maintaining a resilient system includes finding weaknesses and managing drift. Operating such systems at Netflix with resilience patterns over the past 18 months has shown that implementing them in code is only half the battle – knowing how to deploy, configure, operate and maintain resilience is a different set of knowledge.

Examples of techniques to be shared include:

  • latency injection in production to reveal weaknesses
  • inspection of system network connections to find requests being made without isolation patterns protecting them
  • alerts and metrics (realtime and historical) showing changes in dependency behavior that represent new usage patterns, resource utilization and other such changes that can require configuration changes of bulkheads, isolation barriers, circuits etc.
  • when not to change configuration … such as in the middle of an operational event
  • system jitter and what to expect when looking closely at the millisecond level of an application instead of multi-minute averages
    percentile distributions

Concrete examples, metrics, use cases and code will be shown including the use of Hystrix, a library implementing many of these patterns.

Photo of Ben Christensen

Ben Christensen

Netflix

Ben Christensen is a software engineer on the Netflix API Platform team responsible for fault tolerance, performance, architecture and scale while enabling millions of customers to access the Netflix experience across more than 800 different device types. Specializing in Java since the 90s and through years of web and server-side development Ben has gained particular interest and skill in building maintainable, performant, high-volume, high-impact systems. Prior to Netflix, Ben was at Apple in the iTunes division making iOS apps and media available to the world.

Comments on this page are now closed.

Comments

Picture of Ben Christensen
Ben Christensen
06/18/2013 3:39am PDT

Malini, that particular use case is not something I’m addressing in my presentation as I’m focusing on server-side applications. I’d be happy to discuss with you in person though. Chat with me on Twitter @benjchristensen to coordinate.

Malini Kothapalli
06/17/2013 4:40am PDT

I am interested in learning how netflix manages to failover streaming sessions that are in progress. Could that topic be discussed at this session?

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Gloria Lombardo at (203) 381-9245 or glombardo@oreilly.com

Media Partner Opportunities

For media partnerships, contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Velocity contacts