Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Deploying and operating big data analytic apps on the public cloud

Jennifer Wu (Cloudera), Eugene Fratkin (Cloudera), Andrei Savu (Cloudera), Tony Wu (Cloudera)
9:00am12:30pm Tuesday, March 14, 2017
Big data and the Cloud
Location: LL21 A Level: Intermediate
Secondary topics:  Architecture, Cloud
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Architects and admins

Prerequisite knowledge

  • A basic understanding of Hive, Spark, and Impala use cases, deployment workflows, and configuration
  • General knowledge of AWS EC2 and S3

Materials or downloads needed in advance

  • A laptop with an SSH client installed
  • An AWS account and credentials with access to EC2-VPC and S3

What you'll learn

  • Understand the factors to consider when deploying Hadoop in the public cloud
  • Explore the basics of deploying and configuring Hive, Spark, and Impala clusters in AWS
  • Learn how to deploy Hadoop clusters into Azure and Google Cloud Platform

Description

Public cloud usage for Hadoop workloads is accelerating, and consequently, Hadoop components have adapted to leverage cloud infrastructure, including object storage and elastic compute. Hive, Spark, and Impala are able to read input and write output directly to AWS S3 storage. Since data persisted in S3 lives beyond cluster lifecycles, users can now leverage tools to spin up Hadoop clusters for specific time periods or workloads, grow and shrink the cluster as needed, and terminate clusters when the clusters are no longer being used. Therefore, Hadoop clusters in the public cloud can be both transient and elastic in nature.

Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud as they walk you through using existing tools to create and configure Hive, Spark, and Impala deployments in the AWS environment with considerations for network settings, AWS instances types, and security options. They also demonstrate how Hadoop clusters can also be easily deployed into Azure and Google Cloud Platform. Once deployed, you’ll be able to grow and shrink clusters to accommodate your workloads.

Photo of Jennifer Wu

Jennifer Wu

Cloudera

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Photo of Eugene Fratkin

Eugene Fratkin

Cloudera

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Photo of Andrei Savu

Andrei Savu

Cloudera

Andrei Savu is a software engineer at Cloudera, where he’s working on Cloudera Director, a product that makes Hadoop deployments in cloud environments easy and more reliable for customers.

Tony Wu

Cloudera

Tony Wu is a team lead of the Partner Enablement Cloud Hardware Infrastructure and Platform (CHIP) team at Cloudera, which is responsible for Microsoft Azure integration for Cloudera Director. Tony focuses on integrating partner solutions (cloud and hardware) with Cloudera software. He is also part of the team responsible for the EMC DSSD integration with Cloudera’s Distribution of Hadoop (CDH) and Cloudera Manager (CM).

Comments on this page are now closed.

Comments

vijayreddy thachuri |
04/09/2017 11:28am PDT

any recordings or material available