Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Scalability-aware autoscaling of a Spark application

Anirudha Beria (Qubole), Rohit Karlupia (Qubole)
16:3517:15 Wednesday, 1 May 2019
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Data engineers

Level

Intermediate

Prerequisite knowledge

  • Familiarity with cloud concepts

What you'll learn

  • Learn how Qubole improves autoscaling by learning the scalability limits of Spark applications and considering them as part of autoscaling decisions

Description

Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs.

During the runtime of a big data application, requirement of resources may fluctuate as the application progresses. Autoscaling aims to achieve good latency for workloads and reduce resource costs at the same time. Apache Spark’s autoscaling policy is based on the current load during an application’s run. But there’s only so much you can do with this limited information. The policy exponentially increases resources if the current set of tasks at hand aren’t completed within a timeout. This might lead to overscaling, or it might starve the load if timeouts are high. For downscaling, the policy holds on to resources till some other timeout. This, too, has scope for improvement.

What if you included a new factor in your policy: repetition of workloads. In industry use cases, many workloads are repetitive in nature (e.g., ETLs, which may run daily or weekly). These repetitive workloads are, mostly if not always, resource heavy and constitute the majority of resource costs for an organization. Now consider that the autoscaling algorithm is fed with historical information of this workload (achieved with Sparklens, Qubole’s tool for Spark tuning and recommendations). This information would include structure of jobs (temporal placement and relationships among stages) and latency constraints (skew and data partitioning). This information can be leveraged to formulate an autoscaling policy for a future run, which maximizes effectiveness w.r.t latency and costs. You can draw parallels between this process and configs-tuning for jobs; this is policy tuning for autoscale. The idea is then extended to multiple workloads in a pipeline/scheduler, which is a reality in organizations dealing with big data.

Anirudha Beria and Rohit Karlupia explain how Qubole simulates different autoscaling policies based on this idea, how it compares the policies visually and by latency and cost numbers, and how it applies the most effective policy.

Photo of Anirudha Beria

Anirudha Beria

Qubole

Anirudha Beria is a member of the technical staff at Qubole, where he’s working on query optimizations and resource utilization in Apache Spark.

Photo of Rohit Karlupia

Rohit Karlupia

Qubole

Rohit Karlupia is a technical director at Qubole, where his primary focus is making big data as a service debuggable, scalable, and performant. His current work includes SparkLens (open source Spark profiler) and GC/CPU-aware task scheduling for Spark and Qubole Chunked Hadoop File System. Rohit’s primary research interests are performance and scalability of cloud applications. Over his career, he’s mainly been writing high-performance server applications and has deep expertise in messaging, API gateways, and mobile applications. He holds a bachelors of technology in computer science and engineering from IIT Delhi.