Expert Data Science with R

Garrett Grolemund (RStudio)
Data Science
Location: 122-123
Average rating: ****.
(4.21, 14 ratings)

This tutorial will teach you how to streamline your code and your thinking when doing data science. Analysts often spend over 80% of their time preparing and exploring data sets before they begin more formal analysis work. In this tutorial, I will introduce a set of principles — and R packages — that make this work easier and faster.

1. Data sets come in many formats, but software prefers just one.
R runs quickly and intuitively when your data is stored in long format, a layout that allows vectorized programming. Unfortunately this layout is less useful for storing and displaying data. The tidyr package makes it easy to reshape the layout of your data sets while retaining the relationships embedded in the data.

2. Data sets contain more information than they display.
Every data set contains a wealth of descriptive statistics, group level observations, and hidden variables that you need to calculate to use. The dplyr package provides five optimized functions that perform these transformations, as well as a pipe syntax that makes R code more concise and intuitive.

3. The structure of data sets parallels the structure of data visualizations.
Each observation in a data set can be displayed as a mark on a graph. Each variable can be displayed as a visual property of the mark. The result is a grammar of graphics that let’s you create thousands of graphs by choosing several parameters: a data set, a type of mark, and a set of mappings between variables and properties. The ggvis package implements this grammar to make both static and interactive plots.

The tutorial will be led by Garrett Grolemund, Data Scientist and Master Instructor at RStudio. Garrett maintains the lubridate R package, and is the author of Hands-On Programming with R as well as Data Science with R, a forthcoming book by O’Reilly Media.

Photo of Garrett Grolemund

Garrett Grolemund

RStudio

I specialize in teaching people how to use R – and especially Hadley Wickham’s R packages – to do insightful, reliable data science. Hadley was my dissertation advisor at Rice University, where I gained a first-hand understanding of his R libraries. While at Rice, I taught (and helped developed) the courses “Statistics 405: Introduction to Data Analysis,” and “Visualization in R with ggplot2”. Before that, I taught introductory statistics as a Teaching Fellow at Harvard University.

I’m very passionate about helping people analyze data better. I have travelled as far as New Zealand, where R was born, to learn new ways to teach data science. I worked alongside some of the original developers of R to hone my programming skills, and I collaborated with the New Zealand government in a nationwide project to improve how New Zealand teaches data analysis to new statisticians.

Back in the states, I focused my doctoral research on developing pragmatic principles that guide data science. These principles create a foundation for learning R, which is a bit of a layer cake. R is a set of tools for implementing statistical methods, and statistical methods are themselves a set of tools for learning from data. Like all toolkits, R gives its best results to those who use it wisely.

Outside of teaching, I have spent time doing clinical trials research, legal research, and financial analysis. I also develop R software. I co-authored the `lubridate` R package, which provides methods to parse, manipulate, and do arithmetic with date-times, and I wrote the `ggsubplot` package, which extends `ggplot2`. I’m also the Editor-in-chief of RStudio’s Shiny Development Center (shiny.rstudio.com), the official resource for learning to use the shiny package to make interactive web apps with R.

Comments on this page are now closed.

Comments

Picture of Garrett Grolemund
Garrett Grolemund
31-10-2014 20:02 CET

Daniel, This course would be great for a beginner or intermediate R user; it is designed to help beginners become experts quickly. I’ll assume that you know how to make and subset R objects and how to use R functions (e.g. the basics of R). We’ll focus on how to work with and visualize data sets, so you do not need to know any statistics.

Picture of Daniel Teixeira
Daniel Teixeira
31-10-2014 11:32 CET

Hello, is this course only for experts in data science and R or can a beginners / intermediates in R and data science also attend the course and don’t get lost?

Picture of Garrett Grolemund
Garrett Grolemund
4-10-2014 0:14 CEST

Ermelinda,

We’re not going to cover data types, factors, or missing values (NA) in R, but I would be happy to answer questions about these topics after the tutorial.

If you’d like to learn more about these things, I cover them in depth in Hands-On R Programming, a book by O’Reilly.

Picture of ermelinda della valle
ermelinda della valle
20-09-2014 13:33 CEST

hi,
we’ll see how to encode the variables factorial with many modes and missing values ​​…