Sep 23–26, 2019

Running multidisciplinary big data workloads in the cloud

Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 14
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage

Who is this presentation for?

Data engineers, data scientists, BI engineers, analytic engineers, and those in IT



Prerequisite knowledge

- Familiarity with public cloud concepts; - Basic understanding of big data workloads (data engineering, data warehousing).

Materials or downloads needed in advance

A WiFi-enabled laptop (If you want to use the CLI, you need to have Python 3.6 installed and have terminal access.)

What you'll learn

- Learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows. - Understand considerations and best practices for data analytics pipelines in the cloud. - Explore approaches for sharing metadata across workloads in a big data PaaS.


Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature.

There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep dive into cloud architecture and challenges moving to the cloud. Topics include:
- Things to keep in mind when moving the cloud and why it may not be as simple as you thought (e.g. data migration and duplication between on-prem and in the cloud)
- Core cloud paradigms not present on-prem that drive architecture decisions (e.g. bursting and different cluster lifecycles and tenancy)
- Security best practices in the cloud (e.g. the basics, common pitfalls, and things often overlooked that you need to get right)
- How to manage metadata between various workloads across multiple clusters, both on-prem and in the cloud.

In the second part of the talk you’ll get your hands dirty and learn how to successfully set up and run a data pipeline in the cloud that integrates with data engineering and data warehousing workflows. We’ll explore considerations and best practices in getting data pipelines running. Along the way you’ll also see how to share metadata across workloads in a big data architecture. For this second part of the talk you’ll use the Cloudera Altus PaaS offering, powered by Cloudera Altus SDX, to run various big data workloads.

Photo of Jason Wang

Jason Wang


Jason Wang is a software engineer at Cloudera focusing on the cloud.

Photo of Tony Wu

Tony Wu


Tony Wu manages the Altus core engineering team at Cloudera. Previously, Tony was a team lead for the partner engineering team at Cloudera. He is responsible for Microsoft Azure integration for Cloudera Director.

Photo of Vinithra Varadharajan

Vinithra Varadharajan


Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts