September 19–20, 2016: Training
September 20–22, 2016: Tutorials & Conference
New York, NY

Anomaly detection at scale: How Uber continuously monitors 500 local businesses

Akshay Shah (Uber), Michael Hamrah (Uber)
4:45pm–5:25pm Thursday, 09/22/2016
Measuring the right things Automation, Continuous delivery Beekman Audience level: Beginner
Average rating: ****.
(4.80, 5 ratings)

Prerequisite knowledge

  • A general understanding of typical engineering monitoring and alerting tools (e.g., Graphite and Nagios)
  • A high-level understanding of RPC-oriented microservice architectures (useful but not required)
  • What you'll learn

  • Learn a clear set of criteria for selecting high-impact business metrics to monitor
  • Explore a roadmap to anticipate and overcome the difficulties inherent in monitoring business outcomes
  • Understand best practices for the development of a robust anomaly detector tailored to your business
  • Description

    Like many companies, Uber launched with a monolithic backend; driver dispatching, receipt processing, and every other business function ran as a component within one application. A few short years later, Uber runs a complex system of more than a thousand microservices. While applications are now simpler to modify and safer to deploy, they’re further removed from the business that they support—even if every service is healthy, Uber can’t be sure that riders in each city are able to take trips.

    If you’re planning (or in the midst of) a transition to microservices, you’ll need a strategy to deal with the same challenge: your system architecture no longer matches your business. How can you reassemble the metrics from your microservices to confidently monitor the messy world of business outcomes? How can you strike the right balance between catching outages and avoiding midnight pages?

    Akshay Shah and Michael Hamrah share the challenges Uber faced when monitoring business outcomes instead of engineering metrics and why building an anomaly detection system to solve those problems is easier than you might expect. Akshay and Michael describe how Uber selected which metrics to monitor and why traditional software monitoring tools don’t work for business metrics. They also offer an overview of Uber’s scalable, low-noise, highly accurate anomaly detection system, highlighting the design trade-offs made to prioritize simplicity and performance.

    Photo of Akshay Shah

    Akshay Shah

    Uber

    Akshay Shah is a senior software engineer at Uber, where he works on anomaly detection and RPC frameworks. Previously, he was a full stack web application developer, a physician, a public school teacher, and an email spammer.

    Photo of Michael Hamrah

    Michael Hamrah

    Uber

    Michael Hamrah is a senior software engineer on Uber’s observability team, where he focuses on ingestion and management of high-volume metrics. Prior to Uber, he was a principal engineer at Getty Images, working on asset management in the cloud.