Skip to main content

Using R and Hadoop for Statistical Computation at Scale

Antonio (Per data LLC), Joseph Rickert (Revolution Analytics)
Data Science Nassau Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ***..
(3.40, 5 ratings)
Slides:   1-HTM    2-PDF 
  • Hadoop, the 5 minute introduction
  • R, the 5 minute introduction
  • RHadoop, what it is and what the goals are. Where to find it, learn more about it and even contribute to it.
  • The three RHadoop packages, each connecting R with one Hadoop technology: rhdfs for HDFS, rhbase for HBASE and rmr2 for Mapreduce
  • The mapreduce model of computation: the basics, what’s special about it and why it works.
  • The rmr2 api: from.dfs, to.dfs and mapreduce
  • My first mapreduce job
  • IO formats, write one for Airline delay dataset
  • Writing simple filters, selects, aggregations
  • A must have: wordcount
  • Random sampling
  • Clustering with clara and mapreduce.
  • Contingency tables, and very large ones at that.
  • Linear least squares with lots of data and not so many variables.
  • Resampling and forests
  • Model building with revoscaleR
  • Debugging: start local and small, end large and distributed.
  • Performance monitoring
  • A surprise final application as time permits
Photo of Antonio


Per data LLC

Antonio Piccolboni is a data scientist with both industrial and academic experience. His recent work includes the design and implementation of a big data analysis package in R, social network analysis for a top 20 global web site and web analytics for a major web ratings company. He is currently an independent consultant with clients including Dataspora and Revolution Analytics. He blogs at about big data and analytics. His papers have received more than 4000 citations and his Erdős number is 3.

Photo of Joseph Rickert

Joseph Rickert

Revolution Analytics

I am a marketing manager at Revolution Analytics with a passion for analyzing data. I have worked a number of successful Silicon Valley start-ups including Sytek, Alantec, Parallan Computer and Scotts-Valley Instruments. I have graduate degrees in both the Humanities and Statistics. I taught statistics briefly at SJSU and I blog at

Comments on this page are now closed.


Picture of Antonio
10/24/2013 9:31am EDT

@prashant None, I would like to structure it more as a tutorial than a lab (hands-on). I’d rather have the attendees’ full undivided attention than have people with their noses in their laptops. Also given the expected attendance I don’t think I can effectively run a lab of that size.

10/24/2013 8:31am EDT

Hi Antonio, What software (or VM) is required to be installed to attend this tutorial? – Thank you, Prashant

Picture of Antonio
09/27/2013 7:33pm EDT

I would say moderately proficient. Knowing how to create a function for instance, is necessary, but not how to create a package. Knowing about data frames is necessary but not manipulating expression with, say, substitute. There will be no introduction to R, but somebody strong on for example, python. may be able to understand most of what’s happening.

Rajesh Mallipeddi
09/27/2013 3:27pm EDT

Hi, Do we need to be proficient in R to take this tutorial.


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts