Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs.
During the runtime of a big data application, requirement of resources may fluctuate as the application progresses. Autoscaling aims to achieve good latency for workloads and reduce resource costs at the same time. Apache Spark’s autoscaling policy is based on the current load during an application’s run. But there’s only so much you can do with this limited information. The policy exponentially increases resources if the current set of tasks at hand aren’t completed within a timeout. This might lead to overscaling, or it might starve the load if timeouts are high. For downscaling, the policy holds on to resources till some other timeout. This, too, has scope for improvement.
What if you included a new factor in your policy: repetition of workloads. In industry use cases, many workloads are repetitive in nature (e.g., ETLs, which may run daily or weekly). These repetitive workloads are, mostly if not always, resource heavy and constitute the majority of resource costs for an organization. Now consider that the autoscaling algorithm is fed with historical information of this workload (achieved with Sparklens, Qubole’s tool for Spark tuning and recommendations). This information would include structure of jobs (temporal placement and relationships among stages) and latency constraints (skew and data partitioning). This information can be leveraged to formulate an autoscaling policy for a future run, which maximizes effectiveness w.r.t latency and costs. You can draw parallels between this process and configs-tuning for jobs; this is policy tuning for autoscale. The idea is then extended to multiple workloads in a pipeline/scheduler, which is a reality in organizations dealing with big data.
Anirudha Beria and Rohit Karlupia explain how Qubole simulates different autoscaling policies based on this idea, how it compares the policies visually and by latency and cost numbers, and how it applies the most effective policy.
Anirudha Beria is a member of the technical staff at Qubole, where he’s working on query optimizations and resource utilization in Apache Spark.
Rohit Karlupia is a technical director at Qubole, where his primary focus is making big data as a service debuggable, scalable, and performant. His current work includes SparkLens (open source Spark profiler) and GC/CPU-aware task scheduling for Spark and Qubole Chunked Hadoop File System. Rohit’s primary research interests are performance and scalability of cloud applications. Over his career, he’s mainly been writing high-performance server applications and has deep expertise in messaging, API gateways, and mobile applications. He holds a bachelors of technology in computer science and engineering from IIT Delhi.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com