Sep 23–26, 2019

Using Spark to speed up the diagnosis performance for big data applications

Ruixin Xu (Microsoft), Long Tian (Microsoft), Yu Zhou (Microsoft)
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 09

Who is this presentation for?

Engineers, Pruduct Managers, DevOps




Cosmos is Microsoft’s internal big data analysis platform. Everyday, it processes huge numbers of data from Microsoft services like Bing, Office, Windows, Xbox, Dynamics etc. The DevOps team is responsible to keep the service reliability as we committed to customers. For each live site issue, the on-call engineer has a hard deadline to mitigate the problem. Since couple years ago we have been working on bringing IDE style diagnosis experience to large scale applications. However we observed several challenges for on-call engineers to use our IDE diagnosis tools:
• It’s slow to process complex jobs with large profiles, the IDE may crash for jobs with profile larger than 10G.
• We provide auto diagnosis wizard for common issues but on-call engineers still need to digger deeper into various logging systems case by case.
• It requires extra effort for on-call engineers to document their troubleshooting steps.
To solve these challenges, we run experiment to replace the diagnosis engine with Spark and use Jupyter notebook as frontend. Experiment result indicates the Spark based solution has improved the diagnosis performance significantly especially for complex job with large profile. Jupyter notebook also bring the benefit of fast iteration and easy knowledge share. In this session we are going to share our learnings along the journey.

Prerequisite knowledge

Distributed Compting Concept, Debugging, Spark, Jupyter notebook

What you'll learn

Using Spark to build troubleshooting tools for large scale applications
Photo of Ruixin Xu

Ruixin Xu


Ruixin Xu is a Senior Program Manager from Microsoft Azure Big Data Tools team. Her focus areas are product design and project management, development experience in Big Data platforms, software development tool-chain, Software as a Service (SaaS) offerings.

Photo of Long Tian

Long Tian


Long Tian is a Software Engineer Manger at Microsoft Big Data Analytics team. Focus on building developer experience (authoring, debugging, continuous integration and monitoring) for cloud big data services, including Spark, Hive and Azure Datalake.

Photo of Yu Zhou

Yu Zhou


Yu Zhou is currently a Software Development Engineer for Azure Big Data team in Microsoft. He earned his Master of Science degree in EE from Beijing University of Posts and Telecommunications and his Bachelor of Science degree in EE from Hunan University. Yu Zhou is currently work for developing innovative big data solutions including distribute computing system and streaming computing.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts