Hands On Mahout - Mammoth Scale Machine Learning

Data: Analytics and Visualization
Location: Oregon Ballroom 203
Average rating: **...
(2.75, 4 ratings)

Attendee prerequisites for this tutorial are listed below.

Mahout is an open source machine learning library from Apache. At the present stage of development, it is evolving with a focus on collaborative filtering/recommendation engines, clustering, and classification.

There is no user interface, or a pre-packaged distributable server or installer. It is, at best, a framework of tools intend to be used and adapted by developers. The algorithms in this “suite” can be used in applications ranging from recommendation engines for movie websites to designing early warning systems in credit risk engines supporting the cards industry out there.

This tutorial aims at helping you set up Mahout to run on a Hadoop setup. The instructor will walk you through the basic idea behind each of the algorithms. Having done that, we’ll take a look at how it can be run on some of the large-sized datasets and how it can be used to solve real world problems.

If your site or smartphone app or viral facebook app collects data which you really want to use a lot more productively, this session is for you!


Instructions for setting up Mahout

First, subscribe to mahout-oscon googlegroup for updates, announcements and for discussing issues with setting up mahout for the tutorial.

Platforms supported by Mahout

  • Linux
  • Mac
    (its possible to setup Mahout on Cygwin on Windows, but its an unsupported platform for both Hadoop and Mahout)

System Requirements

  • Java 1.6.x or greater.
  • Maven 2.2.x to build the source code.
  • Subversion 1.6 or higher

On Mac

  • Install mac-ports http://www.macports.org/
  • Install maven, subversion using macports . The commands are given below
    • sudo port install subversion
    • sudo port install maven
  • Install Java for MacOSX from the apple website or using the MacUpdate mechanism

On Linux

  • On debian/ubuntu systems, install subversion, jdk and maven using the aptitude repo tool (apt-get install <>)
  • On fedora systems, install subversion, jdk and maven from yum repo tool (yum install <>)
  • Ensure the versions numbers are as given above

Setting up instructions

If everything went fine, you will have a compiled library of mahout on your laptop.
To test if everything has succeeded, run the following command to test your setup.

  • $ bin/mahout kmeans —help

If you face trouble compiling the library, shoot an email to mahout-oscon googlegroup. We will try to help you setup the library prior to coming for the tutorial.

QUESTIONS for the speaker?: Use the “Leave a Comment or Question” section at the bottom to address them.

Photo of Robin Anil

Robin Anil


Robin is a Committer at the Apache Software Foundation where he works with the Mahout Machine Learning community. He is also a co-author of “Mahout in Action” by Manning Publications, a book on how Mahout is used to perform Machine learning on Terabytes of data with ease.

He used to be a Tech Lead on the ML infrastructure for Minekey Inc, a valley based startup which focussing on recommendations and behavioral targeting for publisher content. He was introduced to the newly born Mahout community through the Google Summer of Code program while he was a dual-degree student at IIT Kharagpur. Since then, he has been trying to model machine learning algorithms in to the Map/Reduce format and have successfully merged his Complementary Naive Bayes and Frequent Pattern Mining implementations with the Mahout code base. He is currently working as a Software Engineer at Google, Bangalore. He finds time from work to contribute actively to the Mahout community.

Photo of Ted Dunning

Ted Dunning

MapR, now part of HPE

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation, a PMC member, and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Comments on this page are now closed.


Picture of Robin Anil
Robin Anil
07/29/2011 7:25pm PDT

Greg, you should use the vectors as input to kmeans and then use the clusterdumper tool to view them in text format. See the slides for reference.

Gregory Altman
07/27/2011 5:59pm PDT

Need help: Used Lucene to index documents Run mahout lucene.vector and produced out.vectors and out.dictionary Now I’d like to produce clusters of documents from this in human readable form What mahoot commands should I use and in what sequence? Can you provide an example? Thanks!

Picture of Robin Anil
Robin Anil
07/27/2011 11:11am PDT

Link to slides goo.gl/XMIDl

Picture of Robin Anil
Robin Anil
07/27/2011 9:00am PDT

Do download this dataset goo.gl/qv6Ad, wont take more than 2 mins. Sending it early so that pipes dont’ get jammed during the session.

Rick Gordon
07/27/2011 8:06am PDT

if mvn already installed, its pretty easy – even worked over the conference wifi. looking forward to the session now.

Picture of Robin Anil
Robin Anil
07/25/2011 3:11am PDT

That link didn’t come out right. Attempting again: groups.google.com/group/osc...

Picture of Robin Anil
Robin Anil
07/25/2011 3:10am PDT

The Google Group link is groups.google.com/forum/?hl...!forum/oscon-mahout

Picture of Ted Dunning
Ted Dunning
07/25/2011 1:55am PDT


I doubt that you need all of XCode. Ports should be able to install maven and you should already have java.

Picture of Andrew Serff
Andrew Serff
07/25/2011 1:41am PDT

Next time can you please send the prereq email like a week in advance? I just got it (Monday morning) and now I have to download and install XCode over the OSCON wifi…hopefully it will finish before the tutorial…