Sep 23–26, 2019

ThirdEye: LinkedIn’s business-wide monitoring platform

Akshay Rai (Linkedin)
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 14
Average rating: ****.
(4.50, 4 ratings)

Who is this presentation for?

  • Software engineers and business analysts

Level

Intermediate

Description

Mean time to detect (MTTD) and mean time to restore (MTTR) describe how long it takes to discover a problem and how long it takes you to restore the issue after it was detected. The shorter the MTTD and MTTR, the less time spent in outage and the more availability your product retains. Given that products and services inevitably break at some point, you need to be adept at detecting and restoring service as soon as possible. The issue triage and restoration lifecycle is made up of several steps: capturing metrics, detection (requiring monitoring and alerting), escalation, investigating, and remediation. Each segment of the triage needs to be measured for efficiency and effectiveness in order to keep these metrics as short as possible.

Akshay Rai walks you through ThirdEye, a self-service experience enabling anyone to rapidly identify and investigate deviations in business and system metrics. At LinkedIn, Third Eye is used by several teams spanning business analysts and engineers, and over 10K metrics are actively monitored. ThirdEye provides anomaly detection and collaborative dashboards for data analysis and brings together critical data that impacts metrics in a single place: holidays, deployments, company-wide issues and more. You’ll leave with an understanding of the concepts behind the open source ThirdEye project, how it’s built, and a look into ThirdEye’s insights and long-term plans. Akshay also gives you a powerful analysis of how ThirdEye helped detect and investigate some of the major issues that occurred on LinkedIn.

Prerequisite knowledge

  • General knowledge of monitoring and debugging issues

What you'll learn

  • Learn how to build and leverage a generic domain-independent platform to detect and recover from business and operational issues by running anomaly detection and diagnosis on a variety of metrics and data
Photo of Akshay Rai

Akshay Rai

Linkedin

Akshay Rai is a senior software engineer at LinkedIn, whose primary focus is to reduce the mean time to detect issues and the mean time to resolve issues that arise at LinkedIn. He works on LinkedIn’s next-generation anomaly detection and diagnosis platform. Previously, he actively led the popular Dr. Elephant project at LinkedIn and helped open source it, and he worked on operational intelligence solutions for Hadoop and Spark by building real-time systems that enable monitoring, visualizing, and debugging of big data applications and Hadoop clusters.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Picture of Akshay Rai
Akshay Rai | Senior Software Engineer
10/11/2019 7:49pm EDT

Hi, I have submitted the slides to the organizers. It should be published soon.

Meanwhile you can take a look here.
https://speakerdeck.com/akshayrai09/thirdeye-linkedins-business-wide-monitoring-platform

Anushka Jadhav | sr software engineer
10/09/2019 4:42pm EDT

Hi, can you please post the slides fro this talk. Thanks!

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires