There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Kimoon and Ilan demonstrate how the Spark scheduler can still provide HDFS data locality on Kubernetes if HDFS is also running on Kubernetes and how they made Spark properly discover the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also discover how Spark on Kubernetes interacts with secure HDFS using Kubernetes constructs such as Kubernetes secrets and RBAC. The secure HDFS solution can be used also when Spark on Kubernetes reaches out and accesses HDFS that runs outside Kubernetes clusters.
Kimoon Kim is a software engineer at Pepperdata. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive datasets.
Ilan Filonenko is a four-time returning engineering intern at Bloomberg LP, where he has designed and architected distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s current research studies algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms such as stochastic gradient descent (SGD).
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com