The problem
Right now operating an Apache Spark cluster is a manual process that requires great attention to detail as well as excessive manual human intervention. Furthermore, the existing resource managers often aren’t integrated with underlying cloud technologies. Engineers and operators have tooling readily available to build advanced autoscaling systems to run complex tools like Spark; we just need to integrate the existing components. With a bit of work, autoscaling can change the underlying resource usage and infrastructure.
The solution
Kris Nova and Holden Karau explain how to create a hybrid Spark installation using Kubernetes primitives such as CRDs and wiring them up to declarative infrastructure using the cluster API. This will be the first concrete example of taking advantage of many of the features of the Kubernetes cluster API and will be applying it to real-world enterprise customers using Apache Spark.
The implementation
Kris and Holden demonstrate how to deploy a hybrid Kubernetes cluster directly to Google’s compute engine using the Kubernetes cluster API. This provides the declarative infrastructure components to scale the Spark cluster. For example, scaling worker nodes from 1 to 100:
kubectl scale machines spark-worker-node —replicas=1
kubectl scale machines spark-worker-node —replicas=100
Kris and Holden deploy a layer of software (yet to be named) that consists of a handful of operators that provision the cluster. These operators will handle installing various components on Kubernetes and delivering a minimal set of information needed for a software engineer to begin processing workloads with Apache Spark while handling scale up and down events. Importantly, this can change the underlying Kubernetes cluster resource usage and infra, not simply the percentage of usage of a statically sized cluster for job sharing.
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.
Kris Nova is a senior developer advocate at Heptio focusing on containers, infrastructure, and Kubernetes. She is also an ambassador for the Cloud Native Computing Foundation. Previously, Kris was a developer advocate and an engineer on Kubernetes in Azure at Microsoft. She has a deep technical background in the Go programming language and has authored many successful tools in Go. Kris is a Kubernetes maintainer and the creator of kubicorn, a successful Kubernetes infrastructure management tool. She organizes a special interest group in Kubernetes and is a leader in the community. Kris understands the grievances with running cloud native infrastructure via a distributed cloud native application and recently authored an O’Reilly book on the topic: Cloud Native Infrastructure. Kris lives in Seattle, WA, and spends her free time mountaineering.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com