Skip to main content

How to Create Predictive Models in R Using Ensembles

Giovanni Seni (Intuit)
Data Science Murray Hill Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ***..
(3.44, 9 ratings)

Tutorial Attendee Prerequisites

The following steps MUST be taken if you plan to attend this tutorial:

(1) INSTALL OPEN SOURCE R: Go to, click on the CRAN link under the “Download, Packages” heading on the left-hand side, scroll down to the USA heading, select the first link, and install the most current version of R that is appropriate for your machine (Linux, Mac, or Windows).

(2) INSTALL RSTUDIO: Go to and click on “Download RStudio Desktop.” The website will automatically detect the installation appropriate for your system. RStudio provides an integrated development environment that is more user friendly than the base R installation. Skip this step if you already have an IDE for R that you use and like.

(3) INSTALL PACKAGES TO BE USED IN TUTORIAL: We will use the R packages listed in the “RequiredPackages.txt” file during the hands-on part of the course. Please get a head start by installing them. If you don’t know what it means to “install an R package” then please see

(4) COURSE MATERIALS: A copy of the datasets, handouts, and R code will be distributed on a thumb drive.

==== RequiredPackages.txt ====

1) install.packages(“name”) where name is each of:

2) Install RuleFit3
Go to and follow the installation instructions there according to your computer OS type.

Tutorial Description

The discovery of ensemble methods is one of the most influential developments in Data Analysis and Machine Learning in the past decade. These methods combine multiple models into a single predictive system that is often more accurate than the best of its components. The use of ensemble methods can provide a critical boost to existing systems addressing the hardest of industrial challenges – from investment timing to drug discovery, from fraud detection to recommendation systems – where predictive accuracy is vital. This tutorial, based on a published book by the speaker, offers a concise and hands-on introduction to this breakthrough topic. Participants will use data sets and snippets of R code provided by the instructor to experiment with the “classic” ensemble methods — bagging, random forests, and boosting – as well as the more modern Rule Ensembles, which use a regularization-based post-processing step for improved accuracy and interpretability. Participants will learn the properties of these methods, what they have in common, and their individual strengths and weaknesses. This tutorial is aimed at both novice and intermediate data practitioners with little exposure to ensemble methods.

Giovanni Seni


Giovanni Seni is currently a Senior Data Scientist with Intuit where he leads the Applied Data Sciences team. As an active data mining practitioner in Silicon Valley, he has over 15 years R&D experience in statistical pattern recognition and data mining applications. He has been a member of the technical staff at large technology companies, and a contributor at smaller organizations. He holds five US patents and has published over twenty conference and journal articles. His book with John Elder, “Ensemble Methods in Data Mining – Improving accuracy through combining predictions”, was published in February 2010 by Morgan & Claypool. Giovanni is also an adjunct faculty at the Computer Engineering Department of Santa Clara University, where he teaches an Introduction to Pattern Recognition and Data Mining class.

Comments on this page are now closed.


Marek K Kolodziej
10/30/2013 4:36pm EDT

Would it be possible to post the slides here, like the other speakers have?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts