ThirdEye: LinkedIn’s Business-Wide Monitoring Platform
Who is this presentation for?Software Engineers, Business analysts
Prerequisite knowledgeBasic knowledge of monitoring and debugging issues
What you'll learn
Mean Time To Detect (MTTD) and Mean Time To Restore (MTTR) are metrics used to describe how long it takes to discover a problem and how long it takes you to restore the issue since it was detected. The shorter the MTTD and MTTR, the less time spent in outage and the more availability your product retains. Given that products and services will inevitably break at some point, we need to be adept at detecting and restoring service as soon as possible. The issue triage and restoration lifecycle is made up of several steps: capturing metrics, detection (requiring monitoring/alerting), escalation, investigating, and remediation. Each segment of the triage needs to be measured for efficiency and effectiveness in order to keep these metrics as short as possible and we plan to achieve this using a platform called ThirdEye.
ThirdEye is a self-service experience enabling anyone to rapidly identify and investigate deviations in business and system metrics. At LinkedIn, Third Eye is used by several teams spanning across business analysts, engineers and over 10k metrics are monitored actively. ThirdEye provides anomaly detection and collaborative dashboards for data analysis and brings together critical data that impacts metrics in a single place: Holidays, Deployments, Company-wide issues and more. This talk will introduce the concepts behind the open-source ThirdEye project, how it is built, share our learnings and our long term plans. This session will also take you through a powerful analysis of how Third Eye helped detect and investigate some of the major issues that occurred on LinkedIn.
Akshay Rai is a Senior Software Engineer at LinkedIn whose primary focus is to reduce the Mean time to Detect issues and the Mean time to Resolve issues that arise at LinkedIn. He is currently working on LinkedIn’s next-generation anomaly detection and diagnosis platform. Earlier, he was actively leading the popular Dr. Elephant project at LinkedIn and helped open source it. He has also worked on operational intelligence solutions for Hadoop and Spark by building real-time systems that enable monitoring, visualizing and debugging of Big Data applications and Hadoop clusters.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts