Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Debugging Apache Spark

Holden Karau (Google)
12:0512:45 Thursday, 25 May 2017
Spark & beyond
Location: Capital Suite 12
Level: Intermediate
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • Data engineers

Prerequisite knowledge

  • A basic understanding of Spark on par with Learning Spark

What you'll learn

  • Learn how to debug and scale Apache Spark jobs

Description

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau explores how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.

Spark’s own internal logging can often be quite verbose. Holden demonstrates how to effectively search logs from Apache Spark to spot common problems and discusses options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden looks at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden covers how to quickly use the UI to figure out if certain types of issues are occurring in our job.

Photo of Holden Karau

Holden Karau

Google

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.