It is not uncommon that a real-world data set will fail to be easily manageable. The set
may not fit well into accessed memory or may require prohibitively long processing. As a solution to this problem this session presents using the “infrastructure as code” technology, Docker, to define a system for performing very standard but non-trivial data tasks on medium- to large-scale datasets, using Jupyter as the master controller.
We explore using existing pre-compiled public images created by the major open-source technologies – Python, Jupyter, Postgres – as well as using the Dockerfile to extend these images to suit our specific purposes. We examine the docker-compose technology, and how it can be used to build a linked system, Python workers churning data behind the scenes, Jupyter managing these background tasks. We explore best practices in using existing libraries, as well as developing our own libraries to deploy state-of-the-art machine learning and optimization algorithms.
Finally, we present two use cases for the technologies and methods outlined. First, we
explore a multi-service system for developing machine learning pipelines using scikit-learn. Second, we explore best practices in using Docker and Jupyter to build and run neural networks using AWS GPU instances and keras with a tensorflow backend.
Throughout these case studies, we consider how the average data science practitioner
would perform the requisite tasks in advanced numerical computing, developing locally,
then deploying to cloud for final model development and tuning.
For exhibition and sponsorship opportunities, email jupytersponsorships@oreilly.com
For information on trade opportunities with JupyterCon, email partners@oreilly.com
View a complete list of JupyterCon contacts
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com