Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Beyond shuffling: Tips and tricks for scaling Spark jobs

Holden Karau (Independent)
16:35–17:15 Thursday, 2/06/2016
Spark & beyond
Location: Capital Suite 13 Level: Intermediate
Average rating: ****.
(4.78, 9 ratings)

Prerequisite knowledge

Attendees should know the basics of Spark (RDDs, DataFrames) and have a general understanding of how Spark works.


Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.

Topics include:

  • Working with key/value data
  • Replacing groupByKey for awesomeness
  • Key skew: your data probably has it and how to survive
  • Effective caching and checkpointing
  • Considerations for noisy clusters
  • Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
  • How to make our code testable
Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.