Sep 23–26, 2019

Real time Anomaly detection on observability data using neural networks

Keshav Peswani (Expedia Group), Ashish Aggarwal (Expedia Group)
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 06/07
Secondary topics:  Deep Learning, Temporal data and time-series analytics, Transportation and Logistics

Who is this presentation for?

Technical folks , data scientists, software engineers with intermediate knowledge in neural networks and streaming

Level

Intermediate

Description

We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve one single customer request. Now what happens when one or more services fail at the same time? We are going to look at how Expedia determines these failed services in automated manner and provide high quality of service, which has led to huge improvements in our mean time to know(MTTK) and resolve (MTTR).

In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source) using OpenTracing APIs . We will do a deep dive in our architecture and demonstrate how we ingest terabytes of tracing data in production for hundreds of our micro-services and use this data for trending service errors/latencies/rate. With the increasing number of microservices, there felt the need to have a real time intelligent alerting and monitoring system to contribute to the goal of reducing MTTK and MTTR and move towards 24/7 reliability.

With unique behavioural patterns for each of the service errors, levaraging neural networks to understand the behaviour changes for each of the micro-service and raise alert was indeed a challenging task. The task uncovered a few unexpected challenges, and the solution was less straightforward than we initially estimated. But ultimately the anomaly detector using neural network produced results that beat our expectations, once again validating the interest in neurocomputing that is overtaking the industry.

To achieve this, we predict the service failures in the microservices using recurrent neural networks on trends data and perform anomaly detection on predicted values. We will show how we train a recurrent neural network and auto-tune hyperparameters using Bayesian optimization methods. We will also deep dive into the architecture for the automated training pipeline using AWS sagemaker and lambda and how the anomaly detection works in streaming manner using kafka(kstreams) as the backbone and model deployed on sagemaker in a cost effective manner. At the time of writing , we plan to have human intervention into consideration to refine the alerts in order to reduce false positives and form a 5 step methodology for anomaly detection.

Prerequisite knowledge

Neural Networks particularly recurrent neural networks along with basic knowledge of Kafka and AWS Sagemaker and an interest in understanding how the observability data fits in with neural networks.

What you'll learn

1. How to train neural networks and tune hyper-params for hundreds of time-series metrics in an automated fashion. 2. How to leverage kstreams along with neural networks to perform anomaly detection in real time. 3. How to build a simple and automated training pipeline using AWS Sagemaker and AWS lambdas. 4. How to use telemetry data to improve developer productivity.
Photo of Keshav Peswani

Keshav Peswani

Expedia Group

Keshav Peswani has been working as a Senior Software Engineer at Expedia Group focusing on technology and innovation on various platform initiatives. Keshav is involved in building neural network based anomaly detection model as part of Expedia’s adaptive alerting system, an open source project for anomaly detection. He is also a core contributor of the open source project Haystack from Expedia for distributed tracing, a software which facilitates detection and remediation of problems in service oriented architecture. Keshav started his career at D.E. Shaw & Co. and through his journey has worked on several projects based on deep learning particularly recurrent neural networks, monolithic systems, distributed systems, big data processing. Keshav is a fast learner and passionate about deep learning and event driven architecture.

Keshav has spoken about Haystack in Open Source India, Asia’s largest open source conference and has talked about Haystack in OSFY.

Photo of Ashish Aggarwal

Ashish Aggarwal

Expedia Group

Ashish is working as Principal Engineer at Expedia Group, leading Haystack – an open source project that is rapidly being adopted for distributed tracing in fast growing e-commerce companies like Expedia, HomeAway, Hotels.com, Egencia, Sofi etc.

He is a full-stack software & large-scale data systems engineer with experience in distributed web applications and data analytics platform leveraging a multitude of languages and technologies. Conference speaker @Open Source Summit(Linux Foundation) & Chair Speaker @OpenTracing meetup in Austin 2018.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts