We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.
The command line, although invented decades ago, is an amazing environment for performing such data science tasks. By combining small, yet powerful, command-line tools you can quickly explore your data and hack together prototypes. New tools such as GNU Parallel, jq, and Drake allow you to use the command line for today’s data challenges. Even if you’re already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make you a more efficient data scientist.
We will make use of the Data Science Toolbox, which is a free, open-source virtual environment that allows everybody to get started with data science in minutes. The Data Science Toolbox runs not only on Linux, but also on Mac OS X and Microsoft Windows. In this hands-on tutorial we will cover the following subjects:
Whether you’re entirely new to the command line or already dream in shell scripts, by the end of this tutorial you will have a solid understanding of how to leverage the power of the command line for your next data science project.
The outline of this tutorial roughly follows that of the book
Jeroen Janssens is a senior data scientist at YPlan NYC, tonight’s going out app, where he’s responsible for making event recommendations more personal. Jeroen holds an M.Sc. in Artificial Intelligence from Maastricht University, the Netherlands and a Ph.D. in Machine Learning titled “Outlier Selection and One-Class Classification” from Tilburg University, the Netherlands. He is authoring a book called “Data Science at the Command Line”, which will be published by O’Reilly in summer 2014. Jeroen is @jeroenhjanssens on Twitter and blogs at jeroenjanssens.com.
Comments on this page are now closed.