Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Testing and validating Spark programs

Holden Karau (Independent)
5:10pm–5:50pm Wednesday, 03/30/2016
Spark & Beyond

Location: 210 A/E
Average rating: ****.
(4.11, 19 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of Apache Spark in Scala, Java, or Python.


Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Expanding her Strata NYC talk, “Effective Testing of Spark Programs,” Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.

Holden explores best practices for generating complex test data, setting up performance testing, as well as basic unit testing. The validation component will focus on how to create reasonable validation rules given the constraints of Spark’s accumulators.

Unit testing of Spark programs is deceptively simple. Holden looks at how unit testing of Spark itself is accomplished and distills a number of best practices into traits we can use. This includes dealing with local mode cluster creation and tear down during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL. A number of interesting problems also arise when testing Spark Streaming programs, including handling of starting and stopping the streaming context, providing mock data, and collecting results, and Holden pulls out simple takeaways for dealing with these issues.

Holden also explores Spark’s internal methods for generating random data, as well as options using external libraries to generate effective test datasets (for both small- and large-scale testing). And while acceptance tests are not always thought of as part of testing, they share a number of similarities, so Holden discusses which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job, all while working within the constraints of Spark’s accumulators.

Relevant Spark packages and code:

Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Comments on this page are now closed.


Picture of Holden Karau
Holden Karau
03/03/2016 5:30am PST

This will be different, but continuing in a similar vein, for example including more on the tools to generate data for tests.

03/03/2016 5:22am PST

Is this presentation different from 2016 Spark Summit East?