Tuning Apache Spark is somewhat of a dark art, although thankfully when it goes wrong, all we tend to lose is several hours of our day and our employers money. Much of the data required to effectively tune jobs is already collected inside of Spark, we just need to understand it.
Holden Karau and Rachel Warren explain how to go about auto-tuning selective workloads using a combination of live and historical data, including new settings proposed in Spark 2.4. Holden and Rachel explore sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They then demonstrate what kind of tuning can be done statically (e.g., without depending on historic information) and detail Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them.
Even if the idea of building an auto-tuner sounds as appealing as “using a rusty spoon to debug the JVM on a haunted super computer,” join Holden and Rachel to gain a better understanding of the knobs available to you to tune your Apache Spark jobs.
Also, to be clear, Holden and Rachel don’t promise to stop your pager going off at 2:00am—they just hope this helps.
Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.
Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com