Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Using ML to solve failure problems with ML and AI apps in Spark

Adrian Popescu (Unravel Data Systems), Shivnath Babu (Unravel Data Systems | Duke University)
5:25pm6:05pm Wednesday, September 27, 2017
Data Engineering & Architecture, Spark & beyond
Location: 1A 21/22 Level: Advanced

Who is this presentation for?

  • DevOps engineers, developers, and data scientists

Prerequisite knowledge

  • Familiarity with Apache Spark

What you'll learn

  • Explore a new methodology to solve failure problems with ML and AI apps in Spark


Spark has reached wide adoption for executing ML and AI apps due to its powerful declarative language, efficiency, and interoperability across a large set of input/output data formats. However, one roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark.

In order to understand what went wrong with a Spark app, you have to look at many data sources, such as application logs, resource metrics, configuration settings, container utilization, and more. You know, just a typical big data problem. Instead of analyzing all these data sources one by one, Adrian and Shivnath gathered the logs of a large set of applications and created algorithms to detect and resolve common problems automatically. With these tools and techniques, they were able to resolve application failure problems in seconds instead of weeks.

This methodology uses a novel combination of machine learning techniques and domain knowledge about Spark internals encoded in a new symptoms database design. The machine learning component provides core techniques for problem diagnosis from telemetry data, and domain knowledge acts as checks and balances to guide the diagnosis in the right direction. This unique system design enables the diagnosis to function effectively even in the presence of multiple concurrent failures as well as noisy data prevalent in production environments.

Adrian and Shivnath share a categorization of application failures based on symptoms and root causes (e.g., resource limitations, incorrect coding practices, invalid inputs/outputs, Spark implementation issues, and others), as well as representative signatures for these failures, before demonstrating how to use the root cause diagnosis algorithm and methodology to alleviate the failures.

Photo of Adrian Popescu

Adrian Popescu

Unravel Data Systems

Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer ecience from EPFL, where his thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Giraph BSP), and SQL queries executing at scale on Hive, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest.

Photo of Shivnath Babu

Shivnath Babu

Unravel Data Systems | Duke University

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.