Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Deploying and managing Hive, Spark, and Impala in the public cloud

Andrei Savu (Cloudera), Vinithra Varadharajan (Cloudera), Jennifer Wu (Cloudera), Matthew Jacobs (Cloudera)
1:30pm–5:00pm Tuesday, 09/27/2016
Enterprise adoption
Location: 1B 01/02 Level: Intermediate
Tags: cloud

Prerequisite knowledge

  • A basic understanding of Hive, Spark, and Impala use cases, deployment workflows, and configuration
  • General knowledge of AWS EC2 and S3
  • Materials or downloads needed in advance

  • A laptop with an SSH client installed
  • An AWS account and credentials with access to EC2-VPC and S3
  • What you'll learn

  • Explore the factors to consider when deploying Hadoop in the public cloud
  • Learn the basics of deploying and configuring Hive, Spark, and Impala clusters in AWS
  • Understand how to deploy Hadoop clusters into Azure and Google Cloud Platform
  • Description

    Public cloud usage for Hadoop workloads is accelerating, and consequently, Hadoop components have adapted to leverage cloud infrastructure, including object storage and elastic compute. Hive, Spark, and Impala are able to read input and write output directly to AWS S3 storage. Since data persisted in S3 lives beyond cluster life-cycles, users can now leverage tools to spin up Hadoop clusters for specific time periods or workloads, grow and shrink the cluster as needed, and terminate clusters when the clusters are no longer being used. Therefore, Hadoop clusters in the public cloud can be both transient and elastic in nature.

    Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud as they walk you through using existing tools to create and configure Hive, Spark, and Impala deployments in the AWS environment with considerations for network settings, AWS instances types, and security options. Andrei, Vinithra, Matthew, and Jennifer also demonstrate how Hadoop clusters can also be easily deployed into Azure and Google Cloud Platform. Once deployed, you’ll be able to grow and shrink clusters to accommodate your workloads.

    Photo of Andrei Savu

    Andrei Savu

    Cloudera

    Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

    Photo of Vinithra Varadharajan

    Vinithra Varadharajan

    Cloudera

    Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she’s responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

    Photo of Jennifer Wu

    Jennifer Wu

    Cloudera

    Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

    Photo of Matthew Jacobs

    Matthew Jacobs

    Cloudera

    Matthew Jacobs is a software engineer at Cloudera working on Impala.

    Comments on this page are now closed.

    Comments

    09/27/2016 12:21pm EDT

    Will the slides be available after the talk?

    Picture of Vinithra Varadharajan
    Vinithra Varadharajan
    09/27/2016 11:42am EDT

    Github repo with instructions on how to recreate this demo: https://github.com/cloudera/strata-tutorial-2016-nyc/

    Github repo with sample config files for Cloudera Director and for scripts to prebuild an AMI for faster bootstrap time: github.com/cloudera/director-scripts

    Blog that explains the use-case for Spark jobs: http://blog.cloudera.com/blog/2016/06/how-to-analyze-fantasy-sports-using-apache-spark-and-sql/

    Picture of Vinithra Varadharajan
    Vinithra Varadharajan
    09/27/2016 9:27am EDT

    Hi Kulsoom,

    If you are the admin of an AWS account, as would be the case since you are opening up a new one, then you should have more than sufficient permissions. Specifically, you’ll need permissions to be able to create an IAM role.

    09/27/2016 8:47am EDT

    Hi – I am looking forward to the tutorial. This is a last minute question as my AWS account might not pan out..
    I am purchasing an AWS account now – is there anything I need to do additional or this will be sufficient? I see I can launch an EC2 instance and that I could ssh into that…
    Thanks for the help at the last minute

    Picture of Andrei Savu
    Andrei Savu
    09/26/2016 11:56am EDT

    In this tutorial, we will use AWS Quickstart as the starting point. To be able to successfully execute that CloudFormation template you will need an AWS account with a broad set of permissions, including the ability to create new IAM users.

    https://aws.amazon.com/quickstart/
    http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html