Data Science Toolbox and the Importance of Reproducible Research

Jeroen Janssens (Data Science Workshops)
Data Science
Location: 113
Average rating: ***..
(3.73, 11 ratings)

Data scientists love to learn new technologies, train interesting statistical models, and create exciting data visualizations. Unfortunately, setting up a workable environment by installing all the required software and their dependencies is often not straightforward. These installation issues hamper the (1) learning experience of students, (2) productivity of data scientists, and (3) reproducibility of research.

The Data Science Toolbox is a virtual environment for data science that aims to solve these issues. Its main purpose is to get data scientists started in a matter of minutes. With just one command, a personal Data Science Toolbox is started either locally (using Vagrant) or in the cloud (using Amazon Web Services). In order to serve the two largest communities within data science, the base install of contains both the Python scientific stack and R with many popular packages.

For teachers, authors, and organizations, making sure that their students, readers, or members have the same software installed is not straightforward. The Data Science Toolbox has support for so-called bundles, which is a collection of software or data that is specific to a certain book, course, or project. This way, when someone attends a tutorial or follows along with the examples in a book, no time has to be wasted on setting up the correct environment. Furthermore, researchers are able to create a bundle, which would make their data, code, and experiments easy to distribute and reproduce.

The project is still young, but already has three bundles available:

This open source project would not have been possible without a collection of wonderful platforms and software: Ubuntu, Amazon Web Services, Github, Packer, Ansible, and Vagrant. In the presentation we will, on a very high level, explain how these technologies work and why they are wonderful.

Photo of Jeroen Janssens

Jeroen Janssens

Data Science Workshops

Jeroen is a Senior Data Scientist at YPlan in New York City. He has an M.Sc. in Artificial Intelligence and a Ph.D. in Machine Learning. He is authoring a book titled “Data Science at the Command Line”, which will be published by O’Reilly in summer 2014. Jeroen enjoys biking the Brooklyn Bridge, building tools, and eating stroopwafels. He tweets at @jeroenhjanssens.