All Software Architecture, All the Time
June 10-13, 2019
San Jose, CA

Use of self-healing techniques and failure injections to build a reliable service at Adobe (Velocity)

2:20pm–3:05pm Wednesday, June 12, 2019
Overcoming Obstacles: Lessons in Resilience
Location: Expo Hall Sessions
Average rating: **...
(2.00, 1 rating)

Who is this presentation for?

  • SREs, DevOps, system engineers, and CTOs
  • Level

    Beginner

    What you'll learn

    • Learn about a real-world use case to use self-healing and chaos engineering to design resilient services

      Description

      The advertising industry faces numerous challenges in achieving its goal of targeting a given audience dynamically and accurately to deliver a meaningful brand message. Near-real-time, low-latency delivery of dynamic content, the sheer volume of information processed, and the sparse geographic distribution of the intended eyeball traffic all drive the complexity of building a successful experience for the end user and the brand. Additionally, the competitiveness of the industry makes it critical to preserve low operational expenses while delivering reliably at scale. In attempting to address the above, Nicolas Brousse and Oleksii Mykhailov found that a distributed infrastructure that leverages public cloud providers and a private cloud with open infrastructure technologies can deliver dynamic advertising content with low latency while preserving its high availability. But network or physical utility infrastructures can’t be relied upon to ensure the service dependability. Nicolas and Oleksii show that the complexity of the networks, the sparse geographic distribution of eyeballs, the risk of data center failures, and the increase of encrypted transactions call for thoughtful architectures. The introduction of modern practices, failure injections, and self-healing mechanisms allowed them to improve the service fault tolerance while optimizing for latency and significantly improving service reliability.

      Nicolas and Oleksii present the results covered in their industry paper “Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service” that won the Best Disruptive Idea Award at the 29th IEEE International Symposium on Software Reliability Engineering. They also do a live demo with a failure injection to shut down a full data center and show the traffic shifting live and then recovering.