Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

A deep dive into running big data workloads in the cloud

Vinithra Varadharajan (Cloudera), Philip Langdale (Cloudera), Jason Wang (Cloudera), Fahd Siddiqui (Cloudera)
9:00am12:30pm Tuesday, December 5, 2017
Big data and the cloud
Location: 308/309 Level: Intermediate

Who is this presentation for?

  • Data engineers, ETL developers, big data architects, and IT administrators

Prerequisite knowledge

  • A basic understanding of AWS and big data compute engines

Materials or downloads needed in advance

  • A laptop

What you'll learn

  • Learn how to successfully run a data pipeline in the cloud and integrate data engineering and data analytic workflows
  • Understand the considerations and best practices for data pipelines in the cloud

Description

Public cloud usage for large-scale data processing is rapidly increasing, and running data engineering workloads in the cloud is becoming easier and more cost effective. Compute engines have adapted to leverage cloud infrastructure, including object storage and elastic compute. For example, Hive, Spark, Impala, and HBase compute engines are able to read input from and write output directly to AWS S3 and Azure Data Lake storage. Moreover, these read and write paths have been optimized for fast processing speeds, lowering the overall cost of running a job. In addition, platform as a service offerings for data processing in the cloud have evolved to minimize the operational overhead of clusters, enabling end users to focus on developing, running, and troubleshooting jobs.

It is important for end users to be able to implement data pipeline workflows that seamlessly transition from one stage of the data pipeline to the next. Data engineering, a workload that transforms raw data at scale into clean structured data, is run prior to most analytic and operational database use cases. Vinithra Varadharajan, Philip Langdale, Jason Wang, and Fahd Siddiqui lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud, highlighting cloud infrastructure best practices and illustrating how data engineering workloads interoperate with data analytic engines.

Photo of Vinithra Varadharajan

Vinithra Varadharajan

Cloudera

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Photo of Philip Langdale

Philip Langdale

Cloudera

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Jason Wang

Cloudera

Jason is a software engineer at Cloudera focusing on the cloud.

Photo of Fahd Siddiqui

Fahd Siddiqui

Cloudera

Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)