Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Running Data Analytic Workloads on the Cloud

Mala Ramakrishnan (Cloudera), Eugene Fratkin (Cloudera), Mark Samson (Cloudera)
9:0012:30 Tuesday, 22 May 2018
Data engineering and architecture
Location: Capital Suite 13 Level: Intermediate

Who is this presentation for?

Data Engineers/Developers, Data Scientists, System Architects, System administrators, Information security

Prerequisite knowledge

Should have a good understanding of basics of data warehousing

Materials or downloads needed in advance

A laptop with wifi To use CLI (command line) (optional) they should have Python 2.7 or above installed, and be able to install packages using PIP.

What you'll learn

Data pipeline creation and managament on the cloud/hybrid cloud environments Meta-data sharing and discovery across data applications


Over the past several years we have observed ever-increasing quantities of data being processed within public clouds. Cloud promises to provide solutions to some of the limitations of conventional single, multi-purpose clusters offering hyperscale storage, which is decoupled from elastic, on-demand compute; and allowing data to be shared between on-demand provisioned processing engines such as Hive, Spark, Impala, etc.

But to fulfill this promise one needs to solve several technical challenges – simple resource allocation, cross-cluster metadata sharing, common authorization framework. Without comprehensive answers to these questions, the challenges of single cluster model are simply duplicated inside a public cloud environment.

During our tutorial, we’ll discuss some of the new paradigms that allow one to effectively run production level pipelines with minimal operational overhead and remove the barrier of data discovery, meta-data sharing, and access control. As a part of the deep dive, we will provide a hands-on example of the creation of such pipelines and execution of data processing/data analytic workflows.

Photo of Mala Ramakrishnan

Mala Ramakrishnan


Mala Ramakrishnan heads product initiatives for data engineering at Cloudera. She has 17+ years experience in marketing, product management, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a Masters in Computer Science from Stanford University. Outside of work, she is a mom of two boys 6 and 9 years of age.

Photo of Eugene Fratkin

Eugene Fratkin


Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Photo of Mark Samson

Mark Samson


Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mark has 17 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)