Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
4:40pm5:20pm Thursday, March 28, 2019
Average rating: ****.
(4.60, 5 ratings)

Who is this presentation for?

  • Data engineers and people who got tricked into running a Spark cluster

Level

Intermediate

Prerequisite knowledge

  • A basic understanding of Apache Spark equivalent to Learning Spark Familiarity with high-performance Spark (useful but not required)

What you'll learn

  • Understand the different configuration parameters for Spark and how to set them

Description

Tuning Apache Spark is somewhat of a dark art, although thankfully when it goes wrong, all we tend to lose is several hours of our day and our employers money. Much of the data required to effectively tune jobs is already collected inside of Spark, we just need to understand it.

Holden Karau and Rachel Warren explain how to go about auto-tuning selective workloads using a combination of live and historical data, including new settings proposed in Spark 2.4. Holden and Rachel explore sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They then demonstrate what kind of tuning can be done statically (e.g., without depending on historic information) and detail Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them.

Even if the idea of building an auto-tuner sounds as appealing as “using a rusty spoon to debug the JVM on a haunted super computer,” join Holden and Rachel to gain a better understanding of the knobs available to you to tune your Apache Spark jobs.

Also, to be clear, Holden and Rachel don’t promise to stop your pager going off at 2:00am—they just hope this helps.

Photo of Holden Karau

Holden Karau

Independent

Holden Karau is a transgender Canadian software working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Photo of Rachel Warren

Rachel Warren

Salesforce Einstein

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.