Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Everyday I’m Shuffling - Tips for Writing Better Spark Programs

Vida Ha (Databricks), Holden Karau (Independent)
4:00pm–4:40pm Friday, 02/20/2015
Spark in Action
Location: 230 C
Average rating: ****.
(4.00, 3 ratings)
Slides:   external link

This session will cover a series of tips for writing better spark programs, to be presented visually in slides with code snippets & diagrams to illustrate the points. Here is an example topic in a polished form:

Databricks Spark Knowledgebase on Avoiding GroupByKey

Here is a tentative list of other tips:

  • Working around Bad Data
    When working with a large dataset, chances are – there is bad data is the set. We present how to work around bad data.
  • Factoring Code for Reuse in Batch and Streaming modes
    Spark allows a user to write their business logic once, and use it in the Batch and Streaming modes. We describe the right way to factor code for reuse.
  • Testing Spark Programs
  • Tips for writing good tests for Spark Programs
  • Slow Spark Jobs: Even Sharding
    Spark actions are bottlenecked by the slowest executor task. Uneven sharding is one cause of slow tasks. We present how to detect uneven sharding and suggestions for better sharding functions.
  • ReduceByKey vs. GroupByKey
    ReduceByKey should be preferred over GroupByKey, as ReduceByKey automatically combines data before shuffling, therefore minimizing the amount of data transferred over the network compared to GroupByKey.
  • Execution in the Driver vs. Executor
    Traditional Map-Reduce requires writing a controller main class, a map class, and a reduce class. Spark allows you to write one simple program for all those pieces, but that makes it less clear in the API where code is executed. We see users who run into issues on the Spark users mailing lists because they don’t understand Spark’s execution model. We’ll show the error you see, cover common mistakes beginners make, and better ways for solving the problems.
  • Persisting Large Datasets to Databases
    When persisting a large RDD to a database, database connections should be initiated on each partition rather than in the driver program. This is another mistake we see commonly on the Spark user mailing list.
Photo of Vida Ha

Vida Ha


Vida is currently a Solutions Engineer at Databricks. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google – where she improved search rankings of mobile specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.

Photo of Holden Karau

Holden Karau


Holden Karau is a Software Development Engineer at Databricks and is active in open source. She the author of a book on Spark and has assisted with Spark workshops. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science.

Comments on this page are now closed.


Vida Ha
02/23/2015 2:27am PST

Here are the slides:

Picture of Sergey Zelvenskiy
Sergey Zelvenskiy
02/20/2015 11:42am PST

Please share slides.

Vu Ha
02/20/2015 6:24am PST

Will the slides be available? Thanks!