The exponential compute growth that has occurred in deep learning has opened the doors to creating and testing hundreds or thousands more models than were possible in the past. These models use and generate data for both batch and real time as well as for training and scoring use cases. As data becomes enriched and model parameters are explored, there is a real need for versioning everything, including the data.
Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of projects in this space and solutions while walking you through a customer use case. The customer started with two input files for their core research area. This quickly grew to over 75 input files with nine different data formats, including CSV, HDF5, and PKL, among others. There were a variety of problems with certain data formats as well as a data versioning problem due to iterations of the models and data. The total number of models and parameters sets grew rapidly and when combined with the data versioning issues frustrations escalated. As model creation and management advanced, limitations and issues arose around notebook applications like Jupyter as well as workflow management to keep track of an execution pipeline. The total volume of log outputs grew quickly, and significant volumes of data movement were occurring—source data moving to the GPU, log data back to storage, and then the log data to machines to handle the distributed compute to perform postmodel analytics to evaluate the performance characteristics of the models.
The problems expanded when preparing for production deployment of models and adapting them for real time and not just training and testing. Orchestration of the systems was a big problem in the early stages, and it was discovered that more thought was required to accommodate for further model development. Jim concludes with a follow-on with later model deployment and scoring with a canary and decoy model leveraging the rendezvous architecture.
This session is sponsored by MapR.
Jim Scott is the head of developer relations, data science, at NVIDIA. He’s passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com