Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference
Singapore

Securing big data on YARN, Hive, and Spark clusters

Nitin Khandelwal (Qubole Inc), Abhishek Modi (Qubole)
12:05pm–12:45pm Wednesday, December 7, 2016
Security & governance
Location: 321/322 Level: Beginner
Average rating: ****.
(4.00, 2 ratings)

Prerequisite Knowledge

  • A basic understanding of Hadoop, YARN, and SSL

What you'll learn

  • Learn how YARN security features like SSL encryption and Kerberos-based authentication work
  • Explore the challenges in enabling these features for ephemeral clusters running in the cloud with multitenancy support

Description

Qubole’s Big Data Service began three years back with a hardened Hadoop-1 stack and later started offering YARN-based clusters for next-generation technologies like Spark and Tez, in addition to MapReduce. YARN is a big shift from the traditional Hadoop model, and it supports multitenant platforms, but with support for multitenancy—not to mention bigger organizations moving to the cloud—security becomes a major concern. With YARN, security features such as SSL encryption, Kerberos-based authentication, and HDFS encryption were added.

Achieving the same level of reliability and performance as Qubole’s first-generation Hadoop offering and being able to migrate over scores of customers to use these new security features was a big challenge. Qubole offers running YARN-based services on cloud with features like autoscaling, where nodes may be added and removed at runtime, making it challenging to do SSL-based communication in between them. In addition, services like Hive Server, Spark Notebooks, and Qubole-specific services need to communicate with YARN. Nitin Khandelwal and Abhishek Modi share the challenges they faced in enabling these features for ephemeral clusters running in the cloud with multitenancy support as well as performance numbers for different encryption algorithms available.

Topics include:

  • Adding support for SSL-based communication between ephemeral YARN nodes and different services
  • Framework for managing Kerberos-based authentication on multitenant services
  • HDFS transparent encryption with Key Management Server
  • Performance comparisons of 3DES, RC4, and unencrypted data transfer
Photo of Nitin Khandelwal

Nitin Khandelwal

Qubole Inc

Nitin Khandelwal is working at Qubole as a Staff Engineer. He has worked in a different arena of projects like adding encrypted communication for ephemeral clusters nodes running in the cloud, providing Hive as a multi-tenant service, Autoscaling, etc. He has been contributing significantly in optimizing Tez engine for ETL workloads by adding features like workload-aware autoscaling, fault-tolerance, effective use of spot nodes, etc.
Previously, Nitin was working with Microsoft on VPN Site-to-site gateway service which forms the backbone of Microsoft Azure Stack’s network.

Nitin has completed his Masters in Computer Science from IIIT-Hyderabad. His main areas of focus there were distributed computing, databases and networks.

Photo of Abhishek Modi

Abhishek Modi

Qubole

Abhishek Modi works on Hadoop and YARN stack at Qubole, where he has worked on key features in YARN like its autoscaling framework and balancing of spot nodes in cluster. Previously, he worked with Adobe Systems, where, during his tenure, he filed multiple patents. Abhishek holds a degree from IIT-Varanasi.