Hadoop, Storm, and Spark are fantastic frameworks for processing massive amounts of data in parallel. Every now and then, there is a one-off data science task that could really use some speeding up. For those kind of tasks, it’s probably not worthwhile to set up large frameworks. This presentation demonstrates GNU Parallel, which allows you to easily parallelize and distribute such tasks.
GNU Parallel is a small command-line tool that requires no setup. It allows you to parallelize your task to multiple cores and even distribute it to multiple remote instances. This presentation zooms in on Chapter 8 of Data Science at the Command Line written by Jeroen Janssens and recently published by O’Reilly.
We’ll make use of Amazon Web Services to spin up remote instances during the presentation, although any cloud service and your own laptop can be used. No special setup is required. Topics that we’ll cover during the presentation include:
Through real-world examples, we’ll demonstrate how easy and effective GNU Parallel can be. For example, we’ll use GNU Parallel to speed up downloading data from a web API using multiple connections, and how many machine learning models can be trained in parallel with different parameters. By the end of this presentation, you’ll be able to set up and use GNU Parallel the next time you encounter a one-off data science task that could really use some speeding up.
Jeroen is a lead data scientist at Elsevier in Amsterdam. He has an M.Sc. in artificial intelligence and a Ph.D. in machine learning. Jeroen has authored a book titled “Data Science at the Command Line”, published by O’Reilly. Jeroen is passionate about building open source tools for data science.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.