Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

A deep dive into running data engineering workloads in AWS

Jennifer Wu (Cloudera), Fahd Siddiqui (Cloudera), Paul George (Cloudera), Eugene Fratkin (Cloudera)
9:00am12:30pm Tuesday, September 26, 2017
Big data and the Cloud, Data Engineering & Architecture
Location: 1E 10 Level: Intermediate
Secondary topics:  Architecture, Cloud
Average rating: *....
(1.50, 2 ratings)

Who is this presentation for?

  • Data engineers, ETL developers, and Hadoop administrators

Prerequisite knowledge

  • A basic understanding of AWS concepts and managing Hadoop components such as Hive and Spark

Materials or downloads needed in advance

  • A laptop with an SSH client installed
  • An AWS account with AWS IAM admin access

What you'll learn

  • Learn how to successfully run a data engineering workload in AWS and how to integrate data engineering and data analytic workflows
  • Understand considerations and best practices for data engineers in AWS


Public cloud usage for large-scale data processing is rapidly increasing, and running data engineering workloads in the cloud is becoming easier and more cost effective. Compute engines have adapted to leverage cloud infrastructure, including object storage and elastic compute. For example, Hive, Spark, and Impala compute engines are able to read input from and write output directly to AWS S3 storage. Moreover, these read and write paths have been optimized for fast processing speeds, lowering the overall cost of running a job. In addition, platform-as-a-service offerings for data processing in the cloud have evolved to minimize the operational overhead of clusters, instead allowing the end user to focus on workloads: developing, running, and troubleshooting jobs.

Data engineering, a workload that transforms raw data at scale into clean structured data, is a foundational workload run prior to most analytic and operational database use cases. It’s important for end users to be able to implement data pipeline workflows that seamlessly transition from one stage of the data pipeline to the next. Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Photo of Jennifer Wu

Jennifer Wu


Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Photo of Fahd Siddiqui

Fahd Siddiqui


Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.

Photo of Paul  George

Paul George


Paul George is a software engineer at Cloudera, working on cloud products such as Cloudera Altus. Previously, Paul worked at Palantir Technologies and cofounded a company focused on building data systems for genomics. He holds a PhD in electrical and computer engineering from Cornell University.

Photo of Eugene Fratkin

Eugene Fratkin


Eugene Fratkin is a director of engineering at Cloudera, heading Cloud R&D. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.