Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Understanding data at scale leveraging Spark and Deep Learning Frameworks.

Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
9:0012:30 Tuesday, 22 May 2018
Data science and machine learning
Location: Capital Suite 12 Level: Intermediate

Who is this presentation for?

Data Analysts/Engineers/Architects

Prerequisite knowledge

A basic understanding of data-pipelines, Spark, and machine learning in general. Basic working knowledge of Scala/Python.

Materials or downloads needed in advance

The attendees will be provided with an environment to run the sample data sets in a Cloud based environment. Attendees will also have an option of downloading a VM/ or using local computer to run the programs. (Not ideal.)

What you'll learn

The attendees will understand the pre processing and ingestion techniques/tools ideal for different kinds of data sets: specifically - audio, video/images, text. They further walk away with the nuances of deployment at scale for training and inference across data sets and frameworks.


Increasing complexity of learning algorithms and Deep Neural Networks, combined with size of data and parameters, has made it challenging to exploit existing large-scale data processing pipelines for training and inference.

In this talk we walk the pathway of employing different tools and frameworks, ranging from Spark for pre-processing, to Deep Learning Frameworks for training and inference. We aim to target, the nuances in the data sets, in terms of pre-processing, training and inference as it relates to algorithm/optimization techniques, frameworks and scale.

Across our work in the field, we encounter various kinds of production pipelines.
Leveraging a typical “Big Data Production Pipeline” for learning and inference presents all kinds of challenges and opportunities, especially in terms of mapping data sets to optimal algorithm and/or architecture.

We guide you through the source code (on sample data sets): ingestion, pre-processing, training, inference and deployment across data sets as employed in production at scale.

Photo of Vartika Singh

Vartika Singh


Vartika Singh is a solutions consultant at Cloudera. Previously, Vartika was a data scientist applying machine-learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 10 years of experience designing and developing solutions and frameworks utilizing machine-learning techniques.

Photo of Jeffrey Shmain

Jeffrey Shmain


Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)