This tutorial will teach you how to streamline your code and your thinking when doing data science. Analysts often spend over 80% of their time preparing and exploring data sets before they begin more formal analysis work. In this tutorial, I will introduce a set of principles — and R packages — that make this work easier and faster.
1. Data sets come in many formats, but software prefers just one.
R runs quickly and intuitively when your data is stored in long format, a layout that allows vectorized programming. Unfortunately this layout is less useful for storing and displaying data. The tidyr package makes it easy to reshape the layout of your data sets while retaining the relationships embedded in the data.
2. Data sets contain more information than they display.
Every data set contains a wealth of descriptive statistics, group level observations, and hidden variables that you need to calculate to use. The dplyr package provides five optimized functions that perform these transformations, as well as a pipe syntax that makes R code more concise and intuitive.
3. The structure of data sets parallels the structure of data visualizations.
Each observation in a data set can be displayed as a mark on a graph. Each variable can be displayed as a visual property of the mark. The result is a grammar of graphics that let’s you create thousands of graphs by choosing several parameters: a data set, a type of mark, and a set of mappings between variables and properties. The ggvis package implements this grammar to make both static and interactive plots.
The tutorial will be led by Garrett Grolemund, Data Scientist and Master Instructor at RStudio. Garrett maintains the lubridate R package, and is the author of Hands-On Programming with R as well as Data Science with R, a forthcoming book by O’Reilly Media.
I specialize in teaching people how to use R – and especially Hadley Wickham’s R packages – to do insightful, reliable data science. Hadley was my dissertation advisor at Rice University, where I gained a first-hand understanding of his R libraries. While at Rice, I taught (and helped developed) the courses “Statistics 405: Introduction to Data Analysis,” and “Visualization in R with ggplot2”. Before that, I taught introductory statistics as a Teaching Fellow at Harvard University.
I’m very passionate about helping people analyze data better. I have travelled as far as New Zealand, where R was born, to learn new ways to teach data science. I worked alongside some of the original developers of R to hone my programming skills, and I collaborated with the New Zealand government in a nationwide project to improve how New Zealand teaches data analysis to new statisticians.
Back in the states, I focused my doctoral research on developing pragmatic principles that guide data science. These principles create a foundation for learning R, which is a bit of a layer cake. R is a set of tools for implementing statistical methods, and statistical methods are themselves a set of tools for learning from data. Like all toolkits, R gives its best results to those who use it wisely.
Outside of teaching, I have spent time doing clinical trials research, legal research, and financial analysis. I also develop R software. I co-authored the `lubridate` R package, which provides methods to parse, manipulate, and do arithmetic with date-times, and I wrote the `ggsubplot` package, which extends `ggplot2`. I’m also the Editor-in-chief of RStudio’s Shiny Development Center (shiny.rstudio.com), the official resource for learning to use the shiny package to make interactive web apps with R.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.