Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

What's the Hadoop-la about Kubernetes?

Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)
2:05pm–2:45pm Wednesday, 09/12/2018
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Big data architects and IT managers

Prerequisite knowledge

  • Basic familiarity with containers and orchestration (Kubernetes, Docker, etc.)
  • A working knowledge of big data products, such as Hadoop and Spark

What you'll learn

  • Understand Kubernetes concepts and infrastructure features with respect to stateful workloads, key features in big data products that need to be adapted to container management platforms, and gaps and considerations for running big data workloads, specifically Hadoop on Kubernetes


Containers offer significant value to businesses, including increased developer agility and the ability to move applications between on-premises servers and cloud instances and across data centers. Organizations have embarked on the journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services, such as frontend UIs and simple, content-centric experiences, are often great candidates for stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload. Stateful applications, on the other hand, are services that require backing storage, and keeping state is critical to running the service. Hadoop, Spark, and to a lesser extent, NoSQL platforms such as Cassandra, MongoDB, Postgres, and MySQL are great examples. They require some form of persistent storage that will survive service restarts.

Anant Chintamaneni and Nanda Vijaydev highlight the key gaps and considerations based on a real-world implementation of big data cluster orchestration on Kubernetes. There are several attributes of stateful, multiservice big data applications that need to be considered. Hadoop and Spark are not exactly monolithic applications but are close with their multiple, cooperating services with dynamic APIs. Service startup/teardown ordering requirements with different sets of services running on different hosts (nodes) result in tricky service interdependencies that impact scalability. There is also lots of configuration (aka state), such as host name, IP address, ports and service-specific settings, that needs to be maintained to run fault-tolerant clusters. Anant and Nanda detail technical configurations and customizations required to run Hadoop distributions on Kubernetes and explore the gaps when comparing Hadoop on Kubernetes to the standard deployment of Hadoop on physical servers or virtual machines.

Topics include:

  • Full cluster lifecycle management
  • Big data application support (i.e., requires no modification)
  • Management of storage and networking resources
  • Integration and conformance with existing enterprise services (e.g., LDAP/AD, SSO, TLS)
  • Multitenancy, multiple clusters with different versions, auditing, monitoring, etc.
  • Data locality and performance
Photo of Anant Chintamaneni

Anant Chintamaneni


Anant Chintamaneni is vice president of products at BlueData, where he is responsible for product management and focuses on helping enterprises deploy big data technologies such as Hadoop and Spark. Anant has more than 15 years’ experience in business intelligence, advanced analytics, and big data infrastructure. Previously, Anant led the product management team for Pivotal’s big data suite.

Photo of Nanda Vijaydev

Nanda Vijaydev


Nanda Vijaydev is the lead data scientist and head of solutions at BlueData (now HPE), where she leverages technologies like TensorFlow, H2O, and Spark to build solutions for enterprise machine learning and deep learning use cases. Nanda has more than 10 years of experience in data science and data management. Previously, she worked on data science projects in multiple industries as a principal solutions architect at Silicon Valley Data Science and served as director of solutions engineering at Karmasphere.