Sep 23–26, 2019
Please log in

Downscaling: The Achilles heel of autoscaling Spark clusters

Prakhar Jain (Microsoft), Sourabh Goyal (Qubole)
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 21/22
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data engineers




Adding nodes at runtime (upscaling) to already running Spark on YARN clusters is fairly easy. But taking away these nodes (downscaling) when the workload is low at some later point is difficult. To remove a node from a running cluster, you need to make sure that it isn’t used for compute and storage. But on production workloads, many nodes can’t be taken away because nodes are running some containers, although they are not fully utilized. That means all containers are fragmented on different nodes. For example, each node is running one or two containers or executors, although they have resources to run f containers. Long-running Spark executors makes it even more difficult. Or nodes have some shuffle data in the local disk that will be consumed by a Spark application running on this cluster later. In this case, the resource manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages or tasks.

Prakhar Jain and Sourabh Goyal explore how to improve downscaling in Spark on YARN clusters under the presence of such constraints. They cover changes in scheduling strategy for container allocation in the YARN and Spark task scheduler, which together helps achieve better packing of containers. This makes sure that containers are defragmented on fewer sets of nodes and some nodes don’t have any compute. By being careful in how you assign containers in the first place, you can prevent the chance of running into situations where containers of the same application are running over different nodes. They also examine enhancements to the Spark driver and external shuffle service (ESS) which helps you proactively delete shuffle data that you already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data—thus freeing them from storage and making them available for reclamation for faster downscaling.

Prerequisite knowledge

  • Familiarity with cloud as a concept

What you'll learn

  • Identify efficient downscaling techniques for elastic clusters on the cloud
Photo of Prakhar Jain

Prakhar Jain


Prakhar Jain is working as Senior Software Engineer in Spark team at Microsoft. Prior to this, he was working on cluster orchestration and big data stack at Qubole. Prakhar holds a bachelor of computer science engineering from the Indian Institute of Technology, Bombay, India.

Photo of Sourabh Goyal

Sourabh Goyal


Sourabh Goyal is a member of the technical staff at Qubole, where he works on the Hadoop team. Sourabh holds a bachelor in computer engineering from Netaji Shubas Institute of Technology, University of Delhi

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires