Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.
The official Jupyter Conference
August 22-23, 2017: Training
August 23-25, 2017: Tutorials & Conference
New York, NY

The Containerized Jupyter Platform

Moderated by: Joshua Cook

Who is this presentation for?

Data Scientists, Machine Learning Engineers, Devops Engineers

Prerequisite knowledge

Python

What you'll learn

Docker, docker-compose, distributed python

Description

It is not uncommon that a real-world data set will fail to be easily manageable. The set
may not fit well into accessed memory or may require prohibitively long processing. As a solution to this problem this session presents using the “infrastructure as code” technology, Docker, to define a system for performing very standard but non-trivial data tasks on medium- to large-scale datasets, using Jupyter as the master controller.

We explore using existing pre-compiled public images created by the major open-source technologies – Python, Jupyter, Postgres – as well as using the Dockerfile to extend these images to suit our specific purposes. We examine the docker-compose technology, and how it can be used to build a linked system, Python workers churning data behind the scenes, Jupyter managing these background tasks. We explore best practices in using existing libraries, as well as developing our own libraries to deploy state-of-the-art machine learning and optimization algorithms.

Finally, we present two use cases for the technologies and methods outlined. First, we
explore a multi-service system for developing machine learning pipelines using scikit-learn. Second, we explore best practices in using Docker and Jupyter to build and run neural networks using AWS GPU instances and keras with a tensorflow backend.
Throughout these case studies, we consider how the average data science practitioner
would perform the requisite tasks in advanced numerical computing, developing locally,
then deploying to cloud for final model development and tuning.