Data scientists love to learn new technologies, train interesting statistical models, and create exciting data visualizations. Unfortunately, setting up a workable environment by installing all the required software and their dependencies is often not straightforward. These installation issues hamper the (1) learning experience of students, (2) productivity of data scientists, and (3) reproducibility of research.
The Data Science Toolbox is a virtual environment for data science that aims to solve these issues. Its main purpose is to get data scientists started in a matter of minutes. With just one command, a personal Data Science Toolbox is started either locally (using Vagrant) or in the cloud (using Amazon Web Services). In order to serve the two largest communities within data science, the base install of contains both the Python scientific stack and R with many popular packages.
For teachers, authors, and organizations, making sure that their students, readers, or members have the same software installed is not straightforward. The Data Science Toolbox has support for so-called bundles, which is a collection of software or data that is specific to a certain book, course, or project. This way, when someone attends a tutorial or follows along with the examples in a book, no time has to be wasted on setting up the correct environment. Furthermore, researchers are able to create a bundle, which would make their data, code, and experiments easy to distribute and reproduce.
The project is still young, but already has three bundles available:
This open source project would not have been possible without a collection of wonderful platforms and software: Ubuntu, Amazon Web Services, Github, Packer, Ansible, and Vagrant. In the presentation we will, on a very high level, explain how these technologies work and why they are wonderful.
Jeroen is a Senior Data Scientist at YPlan in New York City. He has an M.Sc. in Artificial Intelligence and a Ph.D. in Machine Learning. He is authoring a book titled “Data Science at the Command Line”, which will be published by O’Reilly in summer 2014. Jeroen enjoys biking the Brooklyn Bridge, building tools, and eating stroopwafels. He tweets at @jeroenhjanssens.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.