Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Data Science at the Command Line

Jeroen Janssens (Data Science Workshops)
9:00am–12:30pm Wednesday, 10/15/2014
Data Science
Location: 1 E8/1 E9
Average rating: ***..
(3.96, 27 ratings)

We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.

The command line, although invented decades ago, is an amazing environment for performing such data science tasks. By combining small, yet powerful, command-line tools you can quickly explore your data and hack together prototypes. New tools such as GNU Parallel, jq, and Drake allow you to use the command line for today’s data challenges. Even if you’re already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make you a more efficient data scientist.

We will make use of the Data Science Toolbox, which is a free, open-source virtual environment that allows everybody to get started with data science in minutes. The Data Science Toolbox runs not only on Linux, but also on Mac OS X and Microsoft Windows. In this hands-on tutorial we will cover the following subjects:

  • Essential concepts of the UNIX command line;
  • Setting up the Data Science Toolbox;
  • Integrating the command line with IPython and R;
  • Filters such as cut, grep, sed, and awk;
  • Scraping websites using curl, scrape, xml2json, and jq;
  • Managing your data science workflow using Drake;
  • Parallelizing and distributing data-intensive pipelines using GNU Parallel;
  • Turning existing Python, R and Java code into reusable command-line tools; and
  • Creating data visualizations and statistical models.

Whether you’re entirely new to the command line or already dream in shell scripts, by the end of this tutorial you will have a solid understanding of how to leverage the power of the command line for your next data science project.

Tutorial Outline

The outline of this tutorial roughly follows that of the book

  • Introduction (The OSEMN Model for Data Science; Why use the Command Line?)
  • Getting Started (Installing the Data Science Toolbox; Essential Concepts of the Command Line)
  • Step 1: Obtaining Data (From Logs; From APIs; From Websites; From Databases; From Microsoft Excel)
  • Creating Reusable Command-line Tools (From One-liners; From Existing Python and R Code)
  • Step 2: Scrubbing Data (Filtering Lines; Replacing Values; Extracting Columns; Merging Datasets)
  • Managing Your Data Science Workflow using Drake
  • Step 3: Exploring Data (Computing Statistics; Visualizing Data)
  • Speeding Up Data-Intensive Commands using GNU Parallel
  • Step 4: Modeling Data (Vowpal Wabbit; Mallet; Rio)
  • Conclusion
Photo of Jeroen Janssens

Jeroen Janssens

Data Science Workshops

Jeroen Janssens is a senior data scientist at YPlan NYC, tonight’s going out app, where he’s responsible for making event recommendations more personal. Jeroen holds an M.Sc. in Artificial Intelligence from Maastricht University, the Netherlands and a Ph.D. in Machine Learning titled “Outlier Selection and One-Class Classification” from Tilburg University, the Netherlands. He is authoring a book called “Data Science at the Command Line”, which will be published by O’Reilly in summer 2014. Jeroen is @jeroenhjanssens on Twitter and blogs at jeroenjanssens.com.

Comments on this page are now closed.

Comments

Picture of Jeroen Janssens
Jeroen Janssens
10/25/2014 1:45pm EDT

Hi Elizabeth,

Sorry for the late reply. I’ve update the gist Please let me know on Twitter if you have any further questions.

Elizabeth Barayuga
10/16/2014 8:37pm EDT

Hi Jeroen, I was wondering if you can share the commands for the bikes for NYC example. I seemed to have an issue displaying the image when I was using the DataScienceToolbox .

Picture of Jeroen Janssens
Jeroen Janssens
10/04/2014 3:10pm EDT

In order to get the most out of this tutorial, I recommend that you install the Data Science Toolbox ahead of time. This is an easy-to-install virtual machine that contains all the command-line tools and data discussed in the corresponding book Data Science at the Command Line. It runs on Microsoft Windows, Mac OS X, and Linux, so everybody should be able to get their hands dirty during the tutorial. See http://datascienceatthecommandline.com for instructions. Thanks and looking forward seeing you there!