Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Running multidisciplinary big data workloads in the cloud

Colm Moynihan (Cloudera), Jonathan Seidman (Cloudera), Michael Kohs (Cloudera)
13:3017:00 Tuesday, 30 April 2019
Data Engineering and Architecture
Location: Capital Suite 4
Average rating: ****.
(4.00, 2 ratings)

Who is this presentation for?

  • Data engineers, data scientists, BI engineers, analytic engineers, and those in IT

Level

Intermediate

Prerequisite knowledge

  • Familiarity with public cloud concepts
  • A basic understanding of big data workloads (data engineering, data warehousing, etc.)

Materials or downloads needed in advance

  • A WiFi-enabled laptop (If you want to use the CLI, you need to have Python 3.6 installed and have terminal access.)

What you'll learn

  • Learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows
  • Understand considerations and best practices for data analytics pipelines in the cloud
  • Explore approaches for sharing metadata across workloads in a big data PaaS

Description

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature.

Colm Moynihan, Jonathan Seidman, and Michael Kohs offer a technical deep dive into cloud architecture and explore the challenges of moving to the cloud. You’ll learn what to keep in mind when moving to the cloud and why it may not be as simple as you thought (e.g., data migration and duplication between on-prem and in the cloud). You’ll also dive into core cloud paradigms not present on-premises that drive architecture decisions (e.g., bursting and different cluster lifecycles and tenancy) as well as security best practices in the cloud (e.g., the basics, common pitfalls, and things often overlooked that you need to get right). Along the way, you’ll learn how to manage metadata between various workloads across multiple clusters, both on-premises and in the cloud.

In the second part of the talk, you’ll get your hands dirty as you learn how to successfully set up and run a data pipeline in the cloud that integrates with data engineering and data warehousing workflows, using the Cloudera Altus PaaS offering, powered by Cloudera Altus SDX. You’ll discover considerations and best practices in getting data pipelines running. You’ll also see how to share metadata across workloads in a big data architecture.

Photo of Colm Moynihan

Colm Moynihan

Cloudera

Colm Moynihan is partner presales manager in EMEA for Cloudera, where he helps system integrators, ISVs, hardware, cloud partners, resellers, and distributors drive digital transformation into joint customers. Previously, Colm was director of presales in EMEA at Informatica, working with resellers, OEMs, and GSIs to integrate, master, and cleanse customers’ enterprise data. Colm has over 25 years’ experience in development, consulting, finance and banking, startups, and large multinational software companies. Colm holds a master’s degree in distributed computing from Trinity College Dublin.

Photo of Jonathan Seidman

Jonathan Seidman

Cloudera

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Photo of Michael Kohs

Michael Kohs

Cloudera

Michael Kohs is a product manager at Cloudera.