Real-time anomaly detection on observability data using neural networks
Who is this presentation for?
- Technical folks, data scientists, software engineers with intermediate knowledge in neural networks and streaming
Expedia is on a mission of connecting people to places through the power of technology. To accomplish this, the company builds and runs hundreds of microservices that provide different functionalities to serve one single customer request.
But there’s still the possibility of one or more services failing at the same time. You’ll take a look at how Expedia determines these failed services in automated manner and provides a high quality of service, which has led to huge improvements in the company’s mean time to know (MTTK) and mean time to resolve (MTTR).
Keshav Peswani and Ashish Aggarwal take you on the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended with the company building its own solution (in open source) using OpenTracing APIs. You’ll take a deep dive into the architecture and see how Expedia ingests terabytes of tracing data in production for hundreds of microservices, and and how it uses this data for trending service errors, latencies, and rate. With the increasing number of microservices, there was the need to have a real-time, intelligent alerting and monitoring system to contribute to the goal of reducing MTTK and MTTR and move toward 24-7 reliability.
With unique behavioral patterns for each of the service errors, leveraging neural networks to understand the behavior changes for each of the microservice and raise alerts was indeed a challenging task. The task uncovered a few unexpected challenges, and the solution was less straightforward than initially estimated. But ultimately the anomaly detector using a neural network produced results that beat the company’s expectations, once again validating the interest in neurocomputing that’s overtaking the industry.
To achieve this, Keshav and Ashish predict the service failures in the microservices using recurrent neural networks on trends data and perform anomaly detection on predicted values. They demonstrate how to train a recurrent neural network and autotune hyperparameters using Bayesian optimization methods. And they detail the architecture for the automated training pipeline using AWS SageMaker and Lambda, as well as how the anomaly detection works in a streaming manner using Kafka (KStreams) as the backbone and model deployed on SageMaker in a cost-effective manner. Currently, Expedia plans to take human intervention into consideration to refine the alerts in order to reduce false positives and form a five-step methodology for anomaly detection.
- A working knowledge of neural networks, particularly recurrent neural networks
- A basic understanding of Kafka and AWS SageMaker
What you'll learn
- Learn how to train neural networks and tune hyperparameters for hundreds of time series metrics in an automated fashion, how to leverage KStreams along with neural networks to perform anomaly detection in real time, how to build a simple and automated training pipeline using AWS SageMaker and AWS Lambdas, and how to use telemetry data to improve developer productivity
Keshav Peswani is a senior software engineer at Expedia Group, focusing on technology and innovation on various platform initiatives. Keshav is involved in building neural network-based anomaly detection models as part of Expedia’s adaptive alerting system, an open source project for anomaly detection. He’s also a core contributor of the open source project Haystack from Expedia for distributed tracing, a software that facilitates detection and remediation of problems in service-oriented architecture. Previously, he was at the D. E. Shaw group, and since then has worked on several projects based on deep learning, particularly recurrent neural networks, monolithic systems, distributed systems, and big data processing. Keshav is a fast learner and passionate about deep learning and event-driven architecture. He’s spoken about Haystack in Open Source India, Asia’s largest open source conference and has talked about Haystack in Open Source For You (OSFY).
Ashish Aggarwal is a principal engineer at Expedia Group, leading Haystack—an open source project that’s rapidly being adopted for distributed tracing in fast-growing ecommerce companies like Expedia, HomeAway, Hotels.com, Egencia, SoFi, etc. He’s a full stack software and large-scale data systems engineer with experience in distributed web applications and data analytics platforms leveraging a multitude of languages and technologies. He’s a conference speaker at the Open Source Summit (Linux) and chair speaker and the OpenTracing meetup in Austin 2018.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts