Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Debugging Apache Spark

Holden Karau (IBM), Joey Echeverria (Rocana)
1:50pm2:30pm Thursday, March 16, 2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Engineers and data scientists working with Apache Spark (although this talk is more engineering focused)

Prerequisite knowledge

  • A basic understanding of Spark on par with Learning Spark

What you'll learn

  • Learn how to debug Apache Spark jobs

Description

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.

Photo of Holden Karau

Holden Karau

IBM

Holden Karau is a software development engineer at IBM and is active in open source. Prior to IBM, she worked on a variety of big data, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. Holden is the author of Learning Spark and has assisted with Spark workshops. She graduated from the University of Waterloo with a bachelors of mathematics in computer science.

Photo of Joey Echeverria

Joey Echeverria

Rocana

Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly Media.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Shilpa Shukla | DATA ENGINEER
03/23/2017 12:13am PDT

Hi Holden and Joey, Could you please upload the slides for the talk? Thank you!

Picture of Alessandro Gagliardi
03/16/2017 7:19am PDT

Are the slides for this talk online somewhere? There are some links I’d like to follow!