San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Please log in

Add to Your Schedule

Improving Spark downscaling; Or, Not throwing away all of our work

Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

14:55–15:35 Wednesday, 1 May 2019

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Average rating:

(3.75, 4 ratings)

Who is this presentation for?

Data engineers migrating to serverless or preemptible instances

Level

Intermediate

Prerequisite knowledge

Knowledge of Spark (equivalent to Learning Spark)

What you'll learn

Understand the challenges of downscaling
Learn approaches to fix them in Spark, along with the trade-offs of each

Description

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again. Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes.

This is less than ideal. However, Spark 3 brings with it the exciting opportunity to make updates. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes—everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise). They also share some depressing numbers about different approaches and the trade-offs involved (because why end on a happy note?).

tl;dr: Just because we don’t need the entire cluster anymore doesn’t mean we need to throw away all of our work

Holden Karau

Independent

Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Mikayla Konst

Google

Mikayla Konst is a software engineer on the Cloud Dataproc team at Google. She helped launch Dataproc’s high availability mode and the Workflow Templates API. She’s currently working on improvements to shuffle and autoscaling.

Ben Sidhom

Google

Ben Sidhom is a software engineer on the Dataproc team at Google, improving the experience of autoscaling with Spark.

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com