Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Leveraging Spark and deep learning frameworks to understand data at scale

Vartika Singh (Cloudera), Juan Yu (Cloudera)
9:0012:30 Tuesday, 22 May 2018
Data science and machine learning
Location: Capital Suite 15 Level: Intermediate

Who is this presentation for?

  • Data analysts, software engineers, and data scientists

Prerequisite knowledge

  • A basic understanding of data pipelines, Spark, and machine learning
  • A working knowledge of Scala and Python

Materials or downloads needed in advance

  • A laptop (You'll be provided with an environment to run the sample datasets in a cloud-based environment; you'll also have the option of downloading a VM or using local computer to run the programs, although this option is not ideal.)

What you'll learn

  • Learn preprocessing and ingestion techniques and tools ideal for different kinds of datasets
  • Understand the nuances of deployment at scale for training and inference across data sets and frameworks


The increasing complexity of learning algorithms and deep neural networks, combined with size of data and parameters, has made it challenging to exploit existing large-scale data processing pipelines for training and inference. Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Vartika and Juan walk you through different tools and frameworks, ranging from Spark for preprocessing to deep learning frameworks for training and inference, targeting the nuances in the datasets as they relate to algorithm optimization techniques, frameworks, and scale.

Photo of Vartika Singh

Vartika Singh


Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Photo of Juan Yu

Juan Yu


Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)