Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Expanding her Strata NYC talk, “Effective Testing of Spark Programs,” Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.
Holden explores best practices for generating complex test data, setting up performance testing, as well as basic unit testing. The validation component will focus on how to create reasonable validation rules given the constraints of Spark’s accumulators.
Unit testing of Spark programs is deceptively simple. Holden looks at how unit testing of Spark itself is accomplished and distills a number of best practices into traits we can use. This includes dealing with local mode cluster creation and tear down during test suites, factoring our functions to increase testability, mock data for RDDs, and mock data for Spark SQL. A number of interesting problems also arise when testing Spark Streaming programs, including handling of starting and stopping the streaming context, providing mock data, and collecting results, and Holden pulls out simple takeaways for dealing with these issues.
Holden also explores Spark’s internal methods for generating random data, as well as options using external libraries to generate effective test datasets (for both small- and large-scale testing). And while acceptance tests are not always thought of as part of testing, they share a number of similarities, so Holden discusses which counters Spark programs generate that we can use for creating acceptance tests, best practices for storing historic values, and some common counters we can easily use to track the success of our job, all while working within the constraints of Spark’s accumulators.
Relevant Spark packages and code:
Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.