Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Leveraging Spark and deep learning frameworks to understand data at scale

Vartika Singh (Cloudera), Juan Yu (Cloudera), Marton Balassi (Cloudera), Steven Totman (Cloudera)
9:0012:30 Tuesday, 22 May 2018
Data science and machine learning
Location: Capital Suite 15 Level: Intermediate
Average rating: ***..
(3.75, 4 ratings)

Who is this presentation for?

  • Data analysts, software engineers, and data scientists

Prerequisite knowledge

  • A basic understanding of data pipelines, Spark, and machine learning
  • A working knowledge of Scala and Python

Materials or downloads needed in advance


What you'll learn

  • Learn preprocessing and ingestion techniques and tools ideal for different kinds of datasets
  • Understand the nuances of deployment at scale for training and inference across data sets and frameworks


The increasing complexity of learning algorithms and deep neural networks, combined with size of data and parameters, has made it challenging to exploit existing large-scale data processing pipelines for training and inference. Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Vartika, Marton, Steven, and Juan walk you through different tools and frameworks, ranging from Spark for preprocessing to deep learning frameworks for training and inference, targeting the nuances in the datasets as they relate to algorithm optimization techniques, frameworks, and scale.

Photo of Vartika Singh

Vartika Singh


Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Photo of Juan Yu

Juan Yu


Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Photo of Marton Balassi

Marton Balassi


Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.

Photo of Steven Totman

Steven Totman


Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.

Comments on this page are now closed.


Picture of Juan Yu
5/06/2018 3:57 BST

Hey Stephanie,

Sorry for the slow response. We have some issues to upload the slides. I put my part on github, you can get it here:

stephanie werli | DATA ENGINEER
23/05/2018 10:33 BST

Hello, I was really interested by Juan Yu’s part. Is there a way to get the slides of her presentation? Thanks.