Skip to main content

Teaching the Elephant to Read: Hadoop + Python + NLP

Sean Murphy (PingThings), Benjamin Bengfort (PingThings, Inc)
Hadoop in Action Sutton Center - Sutton South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ****.
(4.56, 18 ratings)
Slides:   external link

Many of the largest and most difficult to process data sets that we encounter tend not to be from well structured logs or databases, but rather unstructured bodies of text. In recent years, Natural Language Processing techniques have accelerated our ability to stochastically mine data from unstructured text but require large training data sets themselves to produce meaningful results. Simultaneously the growth of distributed computational architectures and file systems have allowed data scientists to deal with larger volumes of data; clearly there is common ground that can allow us to achieve spectacular results.

The two most popular open source tools for both NLP and Distributed Computing, The
Natural Language Toolkit and Apache Hadoop, are written in different languages Python
and Java. We will discusses the methodology to integrate them using Hadoop’s Streaming interface which sends and receives data into and from mapper and reducer scripts via the standard file descriptors.


  1. Creating mappers and reducers with Python (or any executable)
  2. Organizing LARGE bodies of text (corpora)
  3. Task: tokenization and segmentation;Motivation: cross-document language statistics
  4. Task: tagging and stemming; Motivation: information extraction
  5. Task: parsing for treebanks; Motivation: discovering crucial concepts

We will provide a virtual machine (vmware) containing a single node Hadoop development environment with all required software preinstalled and preconfigured for this workshop.

If you have text that you would like to process, please bring your data along. If not, we will be using corpora from Project Gutenberg.

Photo of Sean Murphy

Sean Murphy


Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals.

Photo of Benjamin Bengfort

Benjamin Bengfort

PingThings, Inc

Benjamin Bengfort is a full stack data scientist with a passion for massive machine learning involving gigantic training data sets. A founding partner and CTO at Unbound Concepts, he lead the development of Meridian, the company’s textual complexity ranking algorithm designed to parse and determine the reading level of educational content for K-6 readers. With a professional background in military and intelligence and an academic background in economics and computer science, he brings a unique set of skills and insights to his work, and is currently pursuing a PhD in computer science at UMBC.

Comments on this page are now closed.


Picture of Benjamin Bengfort
Benjamin Bengfort
10/28/2013 10:51am EDT

Code and slides are available here

Michael Grabenstein
10/28/2013 9:50am EDT

What is the URL for the slides?

Picture of Sean Murphy
Sean Murphy
10/27/2013 3:27pm EDT

We don’t have a pre-baked VM ready to download due to hosting and bandwidth constraints. We figured the wifi at the event will be a bit saturated. However, the instructions that were shared walk you through configuring you own VM using Virtual Box. The steps take a little time to go through but are pretty helpful in getting a fuller understanding of exactly what is happening. We will have about a dozen USB drives to pass around tomorrow with compressed images preconfigured for use with Virtual Box.

Israel Zuñiga de la Mora
10/27/2013 3:15pm EDT

Benjamin, Sean: Where can get the VM in an OVA format for those with VirtualBox or ther VM management software?

Picture of Benjamin Bengfort
Benjamin Bengfort
10/27/2013 6:55am EDT

Hi Erik-Jan,

Thanks for the suggestion. The only caveat I have with the quickstart CH4 virtual machine is that you have to figure out where the Hadoop bin and Hadoop lib is, and also ensure that you have a version of Hadoop that has the TypedBytes and AutoInputFormat classes.

I occasionally use CDH4, but I have often preferred my own setup, and after all the only step you’re skipping is the Hadoop install.

Looking forward to meeting you tomorrow!

Erik-Jan van Baaren
10/27/2013 5:38am EDT

You can also download the Cloudera VM and install the python bits on that one. Much easier IMHO.

Picture of Benjamin Bengfort
Benjamin Bengfort
10/25/2013 3:45pm EDT

Get ready for the tutorial by installing the code and the data so that you can follow along! Installation instructions are on the Data Community DC website!

Picture of Sean Murphy
Sean Murphy
10/25/2013 2:24pm EDT

Yes, we will be doing most of the work in Python for this workshop.

Rajesh Mallipeddi
10/25/2013 2:17pm EDT

Is the knowledge of python mandatory for this lab?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts