Engineer for the future of Cloud
June 10-13, 2019
San Jose, CA

Use of self-healing techniques and failure injections to build a reliable service at Adobe

2:20pm3:05pm Wednesday, June 12, 2019
Average rating: **...
(2.75, 4 ratings)

Who is this presentation for?

  • SREs, DevOps, system engineers, and CTOs
  • Level

    Beginner

    What you'll learn

    • Learn about a real-world use case to use self-healing and chaos engineering to design resilient services

      Description

      The advertising industry faces numerous challenges in achieving its goal of targeting a given audience dynamically and accurately to deliver a meaningful brand message. Near-real-time, low-latency delivery of dynamic content, the sheer volume of information processed, and the sparse geographic distribution of the intended eyeball traffic all drive the complexity of building a successful experience for the end user and the brand. Additionally, the competitiveness of the industry makes it critical to preserve low operational expenses while delivering reliably at scale. In attempting to address the above, Nicolas Brousse and Oleksii Mykhailov found that a distributed infrastructure that leverages public cloud providers and a private cloud with open infrastructure technologies can deliver dynamic advertising content with low latency while preserving its high availability. But network or physical utility infrastructures can’t be relied upon to ensure the service dependability. Nicolas and Oleksii show that the complexity of the networks, the sparse geographic distribution of eyeballs, the risk of data center failures, and the increase of encrypted transactions call for thoughtful architectures. The introduction of modern practices, failure injections, and self-healing mechanisms allowed them to improve the service fault tolerance while optimizing for latency and significantly improving service reliability.

      Nicolas and Oleksii present the results covered in their industry paper “Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service” that won the Best Disruptive Idea Award at the 29th IEEE International Symposium on Software Reliability Engineering. They also do a live demo with a failure injection to shut down a full data center and show the traffic shifting live and then recovering.

      Photo of Nicolas Brousse

      Nicolas Brousse

      Adobe

      Nicolas Brousse manages and scales the Adobe Advertising Cloud Infrastructure. Previously senior director of operations engineering at TubeMogul and the company’s sixth employee, Nicolas has grown TubeMogul’s infrastructure over the past decade from several machines to a few thousands servers that handle hundreds of billions of requests per day for clients like Allstate, Chrysler, Heineken, and Hotels.com.

      Adept at adapting quickly to ongoing business needs and constraints, Nicolas leads a global team of site reliability engineers and database architects that monitors Adobe Advertising Cloud infrastructure 24-7 and adheres to DevOps methodology. Nicolas is a frequent speaker at top US technology conferences and regularly gives advice to other operations engineers. Before relocating to the US, Nicolas worked in technology for over 15 years, managing heavy traffic and large user databases for companies like MultiMania, Lycos, and Kewego.

      Photo of Oleksii Mykhailov

      Oleksii Mykhailov

      Adobe

      Oleksii Mykhailov is a senior SRE at Adobe and has been a key contributor to Adobe Advertising Cloud, which handles over 350 billions requests a day. Oleksii built the foundation of that large-scale infrastructure during hypergrowth and while driving key reliability initiatives.