Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Effective testing of Spark programs and jobs

Holden Karau (Google)
4:35pm–5:15pm Wednesday, 09/30/2015
Spark & Beyond
Location: 1 E20 / 1 E21 Level: Intermediate
Average rating: ****.
(4.17, 18 ratings)
Slides:   external link

This session explores best practices of creating both unit and integration tests for Spark programs as well as acceptance tests for the data produced by our Spark jobs. We will explore the difficulties with testing streaming programs, options for setting up integration testing with Spark, and also examine best practices for acceptance tests.

Unit testing of Spark programs is deceptively simple. The talk will look at how unit testing of Spark itself is accomplished, as well as factor out a number of best practices into traits we can use. This includes dealing with local mode cluster creation and teardown during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL.

Testing Spark Streaming programs has a number of interesting problems. These include handling of starting and stopping the Streaming context, and providing mock data and collecting results. As with the unit testing of Spark programs, we will factor out the common components of the tests that are useful into a trait that people can use.

While acceptance tests are not always part of testing, they share a number of similarities. We will look at which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job.

Relevant Spark Packages & Code:
https://github.com/holdenk/spark-testing-base / http://spark-packages.org/package/holdenk/spark-testing-base
https://github.com/holdenk/spark-validator

Photo of Holden Karau

Holden Karau

Google

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Comments on this page are now closed.

Comments

Picture of Holden Karau
Holden Karau
09/30/2015 1:38pm EDT

I’ve uploaded the slides from my talk to http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015 :)