Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Deploying and managing Hive, Spark, and Impala in the public cloud

David Tishgart (Cloudera), Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Jennifer Wu (Cloudera)
9:0012:30 Tuesday, 23 May 2017
Level: Intermediate

Who is this presentation for?

  • Big data administrators, Hadoop administrators, data engineers, business intelligence end users, and cloud administrators

Prerequisite knowledge

  • A basic understanding of Hive, Spark, and Impala use cases, deployment workflows, and configuration
  • General knowledge of AWS EC2 and S3

Materials or downloads needed in advance

    For this tutorial, you will need to come with a computer if you want to participate in the hands-on exercises. In *ADVANCE* of the tutorial, PLEASE make sure you have the AWS CLI installed on your machine.

  • Please do a pip based install (as opposed to an MSI or bundle installation) as you will need pip during the tutorial.
  • You will not need to have an AWS account of your own.

What you'll learn

  • Explore the factors to consider when deploying Hadoop in the public cloud
  • Learn the basics of deploying and configuring Hive, Spark, and Impala clusters in AWS
  • Understand how to deploy Hadoop clusters into Azure and Google Cloud Platform


Public cloud usage for Hadoop workloads is accelerating, and consequently, Hadoop components have adapted to leverage cloud infrastructure, including object storage and elastic compute. Hive, Spark, and Impala are able to read input and write output directly to AWS S3 storage. Since data persisted in S3 lives beyond cluster lifecycles, users can now leverage tools to spin up Hadoop clusters for specific time periods or workloads, grow and shrink the cluster as needed, and terminate clusters when the clusters are no longer being used. Therefore, Hadoop clusters in the public cloud can be both transient and elastic in nature.

Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud as they walk you through using existing tools to create and configure Hive, Spark, and Impala deployments in the AWS environment with considerations for network settings, AWS instances types, and security options. Eugene, Philip, David, and Jennifer also demonstrate how Hadoop clusters can also be easily deployed into Azure and Google Cloud Platform. Once deployed, you’ll be able to grow and shrink clusters to accommodate your workloads.

David Tishgart


David Tishgart is director of cloud product marketing at Cloudera. Prior to joining Cloudera, David ran product and partner marketing programs at Gazzang, helping drive business demand for enterprise encryption and key management for big data. Prior to Gazzang, he was director of services marketing at Dell. David holds a bachelor of broadcast journalism degree from The University of Texas at Austin.

Photo of Philip Langdale

Philip Langdale


Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Photo of Eugene Fratkin

Eugene Fratkin


Eugene Fratkin is a director of engineering at Cloudera, heading Cloud R&D. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Photo of Jennifer Wu

Jennifer Wu


Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.