Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

Jim Scott (NVIDIA)
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 01/02

What you'll learn

  • Understand the major impediments to successful completion of deep learning projects and how to solve them


The exponential compute growth that has occurred in deep learning has opened the doors to creating and testing hundreds or thousands more models than were possible in the past. These models use and generate data for both batch and real time as well as for training and scoring use cases. As data becomes enriched and model parameters are explored, there is a real need for versioning everything, including the data.

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of projects in this space and solutions while walking you through a customer use case. The customer started with two input files for their core research area. This quickly grew to over 75 input files with nine different data formats, including CSV, HDF5, and PKL, among others. There were a variety of problems with certain data formats as well as a data versioning problem due to iterations of the models and data. The total number of models and parameters sets grew rapidly and when combined with the data versioning issues frustrations escalated. As model creation and management advanced, limitations and issues arose around notebook applications like Jupyter as well as workflow management to keep track of an execution pipeline. The total volume of log outputs grew quickly, and significant volumes of data movement were occurring—source data moving to the GPU, log data back to storage, and then the log data to machines to handle the distributed compute to perform postmodel analytics to evaluate the performance characteristics of the models.

The problems expanded when preparing for production deployment of models and adapting them for real time and not just training and testing. Orchestration of the systems was a big problem in the early stages, and it was discovered that more thought was required to accommodate for further model development. Jim concludes with a follow-on with later model deployment and scoring with a canary and decoy model leveraging the rendezvous architecture.

This session is sponsored by MapR.

Photo of Jim Scott

Jim Scott


Jim Scott is the head of developer relations, data science, at NVIDIA. He’s passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).