Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

R Day (Full Day)

Garrett Grolemund (RStudio), Nina Zumel (Win-Vector LLC), John Mount (Win-Vector LLC), Stephen Elston (Quantia Analytics, LLC)
9:00am–5:00pm Tuesday, 03/29/2016
Average rating: ***..
(3.88, 8 ratings)

Prerequisite knowledge

This tutorial is aimed at those with an interest in data science who have some previous knowledge of R.


Tutorial prerequisites

Please bring a laptop and power cord—each class will be centered around hands-on exercises.

Before class, please install both R and the RStudio IDE and ensure that your computer can connect to the Internet. You will also need to download several R packages. We will email you the final list of packages to download the week before the class.

You can find instructions on how to install R, the RStudio IDE, and the R packages at R for Data Science.


9:00 AM – 10:30 AM
R quickstart: Transform and visualize data
Garrett Grolemund
Garrett Grolemund explores the most used—and most powerful—parts of the R language. You will learn the best ways to perform the core tasks of data science, including:

  • How to transform your data (with the dplyr package).
  • How to visualize your data (with the ggplot2 package).

These fast and intuitive packages will provide a solid foundation for everything else you do in R.

10:30 AM – 11:00 AM

11:00 AM – 12:30 PM
Validating models in R
Nina Zumel and John Mount
Nina Zumel and John Mount demonstrate a number of techniques, R packages, and R code for validating predictive models, using example code, data, and live demonstrations and exercises. Learn how to determine if there is usable signal in your data, select variables, and choose models using R and R graphics (ggplot2). Increase your statistical efficiency and squeeze more signal out of your data.

12:30 PM – 1:30 PM

1:30 PM – 3:00 PM
Scaling R: Analytics for big data
Stephen Elston
Stephen Elston teaches techniques for deep exploration and modeling of large, complex datasets with R, including:

  • Using the divide and recombine approach to partition data in meaningful subsets, perform embarrassingly parallel computations on the subsets, and combine the results in a statistically valid manner
  • Creating useful key-value pairs to partition data and apply the MapReduce algorithm on parallel backends such as Hadoop or Spark
  • Visualizing the most important components of large, complex datasets by ordering the multiple views

3:00 PM – 3:30 PM

3:30 PM – 5:00 PM
Reproducible reports with big data
Garrett Grolemund
Garrett Grolemund demonstrates a time-saving workflow that has become the new standard for reproducible research. The R Markdown package makes it easy to document both your code and your results in the same file. With an R Markdown file and the click of a button, you can re-execute your analysis with the most up-to-date code and data to create new results, and/or generate a polished report in a variety of formats (HTML, PDF, DOC, etc.) to share your results. Garrett offers some best practices that further increase the efficiency of reproducible research with R Markdown.

Photo of Garrett Grolemund

Garrett Grolemund


Garrett Grolemund is a data scientist and chief instructor for RStudio, Inc. Garrett is a longtime user and advocate of R; he wrote the popular lubridate package for working with dates and times in R. Garrett designed and delivered the highly rated O’Reilly video series Introduction to Data Science with R and is the author of Hands-On Programming with R and the coauthor, with Hadley Wickham, of R for Data Science. He holds a PhD in statistics and specializes in teaching others how to do data science with open source tools.

Photo of Nina Zumel

Nina Zumel

Win-Vector LLC

Nina Zumel is cofounder and principal at Win-Vector LLC, a data science consultancy based in San Francisco. She frequently writes and speaks on statistics and machine learning. She is also the coauthor of the popular book Practical Data Science with R (Manning 2014).

Photo of John Mount

John Mount

Win-Vector LLC

John Mount is a principal consultant at Win-Vector LLC, a San Francisco data science consultancy. John has worked as a computational scientist in biotechnology and a stock-trading algorithm designer and has managed a research team for (now an eBay company). He is the coauthor of Practical Data Science with R (Manning Publications, 2014). John started his advanced education in mathematics at UC Berkeley and holds a PhD in computer science from Carnegie Mellon (specializing in the design and analysis of randomized algorithms). He currently blogs about technical issues at the Win-Vector blog, tweets at @WinVectorLLC, and is active in the Rotary. Please contact for projects and collaborations.

Photo of Stephen Elston

Stephen Elston

Quantia Analytics, LLC

Stephen Elston is an experienced big data geek, data scientist, and software business leader. Steve is principal consultant at Quantia Analytics, LLC, where he leads the building of new business lines, manages P&L, and takes software products from concept and financing through development, intellectual property protection, sales, customer shipment, and support. Steve is also an instructor for the University of Washington data science program. Steve has over two decades of experience in visualization, predictive analytics and machine learning, at scales from small to massive, using many platforms including Hadoop, Spark, R, S/SPLUS, and Python. He has created solutions in fraud detection, capital markets, wireless systems, law enforcement, and streaming analytics for the IoT.

Comments on this page are now closed.


Picture of Garrett Grolemund
Garrett Grolemund
03/30/2016 3:18am PDT

Thank you to everyone who participated in R Day. All of the course materials are available on You can download them here:

03/28/2016 9:32am PDT

updated link for packages to pre-install:

03/26/2016 9:39am PDT

Hi John & co., Very excited for your workshop!

Picture of John Mount
03/19/2016 5:11am PDT

The “Validating models in R” section is going to be some lecture (with slides we will share) and about 5 mini-labs where we work through finished solutions to some exciting model scoring and validation problems in R. The intent is to give the participants ideas, references and working R code to take away.

If you bring a laptop with R and RStudio installed you should be able to run most of the examples with us (especially if you pre-install the packages listed here ; the labs are not polished yet, so PLEASE don’t be too critical of the current state of the code we intend to share; also the intent of each lab will be explained in the lecture). We are also looking into the possibility of configuring RStudio server for participants.

We will be sharing all code and walking through pre-rendered results. So you should have a lively and great experience in this tutorial workshop however you decide to participate.

Picture of John Mount
01/17/2016 2:55am PST

I just wanted to say Nina Zumel and I are really looking forward to teaching this session. Validating models is a very exciting topic and R has some great facilities for both doing this and explaining this.