Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Autoscaling Spark on Kubernetes

Holden Karau (Independent), Kris Nova (Independent)
14:0514:45 Thursday, 2 May 2019
Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)
Average rating: ****.
(4.86, 7 ratings)

Who is this presentation for?

  • Data engineers and operations folks wanting to more effectively use cloud resources for big data

Level

Intermediate

Prerequisite knowledge

  • Basic knowledge of Apache Spark or a similar system
  • Familiarity with Kubernetes

What you'll learn

  • Understand how Spark autoscaling works today, the things being done to make it work more effectively on Kubernetes, and why getting your cloud bill under control with non-cloud-aware resource managers has been so frustrating

Description

The problem

Right now operating an Apache Spark cluster is a manual process that requires great attention to detail as well as excessive manual human intervention. Furthermore, the existing resource managers often aren’t integrated with underlying cloud technologies. Engineers and operators have tooling readily available to build advanced autoscaling systems to run complex tools like Spark; we just need to integrate the existing components. With a bit of work, autoscaling can change the underlying resource usage and infrastructure.

The solution

Kris Nova and Holden Karau explain how to create a hybrid Spark installation using Kubernetes primitives such as CRDs and wiring them up to declarative infrastructure using the cluster API. This will be the first concrete example of taking advantage of many of the features of the Kubernetes cluster API and will be applying it to real-world enterprise customers using Apache Spark.

The implementation

Kris and Holden demonstrate how to deploy a hybrid Kubernetes cluster directly to Google’s compute engine using the Kubernetes cluster API. This provides the declarative infrastructure components to scale the Spark cluster. For example, scaling worker nodes from 1 to 100:

kubectl scale machines spark-worker-node —replicas=1

kubectl scale machines spark-worker-node —replicas=100

Kris and Holden deploy a layer of software (yet to be named) that consists of a handful of operators that provision the cluster. These operators will handle installing various components on Kubernetes and delivering a minimal set of information needed for a software engineer to begin processing workloads with Apache Spark while handling scale up and down events. Importantly, this can change the underlying Kubernetes cluster resource usage and infra, not simply the percentage of usage of a statically sized cluster for job sharing.

Photo of Holden Karau

Holden Karau

Independent

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Photo of Kris Nova

Kris Nova

Independent

Kris Nova is a senior developer advocate at Heptio focusing on containers, infrastructure, and Kubernetes. She is also an ambassador for the Cloud Native Computing Foundation. Previously, Kris was a developer advocate and an engineer on Kubernetes in Azure at Microsoft. She has a deep technical background in the Go programming language and has authored many successful tools in Go. Kris is a Kubernetes maintainer and the creator of kubicorn, a successful Kubernetes infrastructure management tool. She organizes a special interest group in Kubernetes and is a leader in the community. Kris understands the grievances with running cloud native infrastructure via a distributed cloud native application and recently authored an O’Reilly book on the topic: Cloud Native Infrastructure. Kris lives in Seattle, WA, and spends her free time mountaineering.