In multiple distributed systems, applications can fail due to many reasons, such as out-of-memory or due to a timeout while waiting for some resource. Or the root cause may be deeper. For example, a timeout may be due to an application getting delayed because it accesses tables containing small files or non-splittable files and thus, accessing data on them is particularly slow. However the reason might be, when an application fails, users are required to fix the cause of the failure to get the application running successfully. Since applications may interact with multiple components, a failed application can generate a large set of raw logs. These logs typically contain thousands of messages, including errors and stacktraces. Hunting for the root cause of an application failure from these messy, raw, and distributed logs is hard for experts, and a nightmare for the thousands of new users coming to the big data stack.
Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. They begin by describing how to automatically generate insights into a failed application in a multiengine big data stack before detailing their approach to automatically identify the root cause of application failure, which consists of continuous log collection of Spark and Impala application failures and an automatic labeling mechanism using unsupervised learning; converting logs into feature vectors using a three-layer neural network; and learning a predictive model for RCA from these feature vectors using deep learning and active learning techniques. They conclude by discussing algorithms for automatic fixes for failed applications that use examples of successful and failed runs of the application or similar applications from history. They’ll then try out a limited number of alternative configurations to get the application quickly to a running state and walk you through getting the application to a resource-efficient running state.
Alkis Simitsis is a chief scientist for cybersecurity analytics at Micro Focus. Alkis has more than 15 years of experience building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. He holds 26 US patents and has filed over 50 patent applications in the US and worldwide. He’s published more than 100 papers in refereed international journals and conferences (top publications cited 5,000+ times) and frequently serves in various roles in program committees of top-tier international scientific conferences. He’s also an IEEE senior member and a member of the ACM.
Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com