Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Poor man's parallel pipelines

Jeroen Janssens (Data Science Workshops B.V.)
14:35–14:55 Thursday, 7/05/2015
Data Science
Location: King's Suite - Balmoral
Average rating: ***..
(3.71, 7 ratings)

Prerequisite Knowledge

Some familiarity with the command line on Linux or Mac OS is useful, but not required. Microsoft Windows users will be able to follow along if they install the free Data Science Toolbox.

Description

Hadoop, Storm, and Spark are fantastic frameworks for processing massive amounts of data in parallel. Every now and then, there is a one-off data science task that could really use some speeding up. For those kind of tasks, it’s probably not worthwhile to set up large frameworks. This presentation demonstrates GNU Parallel, which allows you to easily parallelize and distribute such tasks.

GNU Parallel is a small command-line tool that requires no setup. It allows you to parallelize your task to multiple cores and even distribute it to multiple remote instances. This presentation zooms in on Chapter 8 of Data Science at the Command Line written by Jeroen Janssens and recently published by O’Reilly.

We’ll make use of Amazon Web Services to spin up remote instances during the presentation, although any cloud service and your own laptop can be used. No special setup is required. Topics that we’ll cover during the presentation include:

  • Installing GNU Parallel
  • Processing many large files
  • Processing streaming data
  • Discovering remote instances
  • Keeping a log of all the jobs
  • Timing out, resuming, retrying, and monitoring jobs

Through real-world examples, we’ll demonstrate how easy and effective GNU Parallel can be. For example, we’ll use GNU Parallel to speed up downloading data from a web API using multiple connections, and how many machine learning models can be trained in parallel with different parameters. By the end of this presentation, you’ll be able to set up and use GNU Parallel the next time you encounter a one-off data science task that could really use some speeding up.

Photo of Jeroen Janssens

Jeroen Janssens

Data Science Workshops B.V.

Jeroen is a lead data scientist at Elsevier in Amsterdam. He has an M.Sc. in artificial intelligence and a Ph.D. in machine learning. Jeroen has authored a book titled “Data Science at the Command Line”, published by O’Reilly. Jeroen is passionate about building open source tools for data science.