Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Understanding Spark Tuning with Auto Tuning (or magical spells to stop your pager going off at 2am*)

Holden Karau (Google), Rachel Warren (Independent), Anya Bida (Alpine Data)
11:1511:55 Thursday, 24 May 2018

Who is this presentation for?

data engineers and people who got tricked into running a spark cluster

Prerequisite knowledge

Knowledge of Apache Spark equivalent with Learning Spark. Knowledge of High Performance Spark will help.

What you'll learn

Better understanding of the different configuration parameters for Spark and how to set them.


Tuning Apache Spark is somewhat of a dark art, although thankfully when it goes wrong all we tend to lose is several hours of our day and our employers money. This talk will look at how we can go about auto-tuning selective work loads using a combination of live and historical data.

Much of the data required to effectively tune jobs is already collected inside of Spark, we just need to understand it. This talk will look at some sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work.

This talk will also look at what kind of tuning can be done statically (e.g. without depending on historic information). This talk will also look at Spark own built in components for auto tunning (currently dynamically scaling cluster size) and how we can improve them.

Even if the idea of building an “auto-tuner” sounds as appealing as “using a rusty spoon to debug the JVM on a haunted super computer”, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
*Also to be clear we don’t promise to stop your pager going off at 2am, we just hope this helps.

Photo of Holden Karau

Holden Karau


Holden is a trans Canadian open source developer advocate with a focus on Apache Beam, Spark, and related “big data” tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. Prior to joining Google as a Developer Advocate she worked at IBM, Alpine, Databricks, Google (yes this is her second time), Foursquare, and Amazon. She was tricked into the world of big data while trying to improve recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Photo of Rachel Warren

Rachel Warren


Rachel Warren is a programmer, data analyst, adventurer, and aspiring data scientist. After spending a semester helping teach algorithms and software engineering in Africa, Rachel has returned to the Bay Area, where she is looking for work as a data scientist or programmer. Previously, Rachel worked as an analyst for both Pandora and the Political Science department at Wesleyan. She is currently interested in pursuing a more technical, algorithmic, approach to data science and is particularly passionate about dynamic learning algorithms (ML) and text analysis. Rachel holds a BA in computer science from Wesleyan University, where she completed two senior projects: an application which uses machine learning and text analysis for the Computer Science department and a critical essay exploring the implications of machine learning on the analytic philosophy of language for the Philosophy department.

Photo of Anya Bida

Anya Bida

Alpine Data

Anya loves her position as Senior Member of Technical Staff (SRE) at Salesforce. She’s also a co-organizer of the SF Big Analytics meetup group, and is always looking for ways to make platforms more scalable / cost efficient / secure. Before Salesforce, Anya enjoyed contributing at Alpine Data where she focused on Spark Operations. The opinions expressed in this presentation do not reflect those of Anya’s employers, past or present.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)