Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Debugging Apache Spark

Holden Karau (IBM), Joey Echeverria (Rocana)
1:50pm2:30pm Thursday, March 16, 2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Engineers and data scientists working with Apache Spark (although this talk is more engineering focused)

Prerequisite knowledge

  • A basic understanding of Spark on par with Learning Spark

What you'll learn

  • Learn how to debug Apache Spark jobs

Description

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.

Photo of Holden Karau

Holden Karau

IBM

Holden Karau is transgender Canadian, Apache Spark committer, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.

Photo of Joey Echeverria

Joey Echeverria

Rocana

Joey Echeverria is the director of engineering at Rocana, where he builds applications for scaling IT operations built on the Apache Hadoop platform. Joey is a committer on the Kite SDK, an Apache-licensed data API for the Hadoop ecosystem. Joey was previously a software engineer at Cloudera, where contributed to several ASF projects including Apache Flume, Apache Sqoop, Apache Hadoop, and Apache HBase. Joey is also a coauthor of Hadoop Security, published by O’Reilly Media.

Comments on this page are now closed.

Comments

Picture of Holden Karau
Holden Karau | SOFTWARE DEVELOPMENT ENGINEER
03/27/2017 12:37pm PDT

Sorry for the slowness in getting the slides up – https://www.slideshare.net/hkarau/debugging-apache-spark-scala-python-super-happy-fun-times-2017 :)

Shilpa Shukla | DATA ENGINEER
03/23/2017 12:13am PDT

Hi Holden and Joey, Could you please upload the slides for the talk? Thank you!

Picture of Alessandro Gagliardi
03/16/2017 7:19am PDT

Are the slides for this talk online somewhere? There are some links I’d like to follow!