Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

HDFS on Kubernetes: Tech deep dive on locality and security

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
4:20pm5:00pm Thursday, March 8, 2018
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data scientists, big data engineers, software developers, and big data architects

Prerequisite knowledge

  • A basic understanding of Spark and big data platforms and architecture

What you'll learn

  • Learn how to run Spark on Kubernetes while accessing HDFS data in the right way

Description

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Kimoon and Ilan demonstrate how the Spark scheduler can still provide HDFS data locality on Kubernetes if HDFS is also running on Kubernetes and how they made Spark properly discover the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also discover how Spark on Kubernetes interacts with secure HDFS using Kubernetes constructs such as Kubernetes secrets and RBAC. The secure HDFS solution can be used also when Spark on Kubernetes reaches out and accesses HDFS that runs outside Kubernetes clusters.

Kimoon Kim

Pepperdata

Kimoon Kim is a software engineer at Pepperdata. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive datasets.

Ilan Filonenko

Bloomberg LP

Ilan Filonenko is a four-time returning engineering intern at Bloomberg LP, where he has designed and architected distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s current research studies algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms such as stochastic gradient descent (SGD).