Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

R for big data

Garrett Grolemund (RStudio), Nathan Stephens (RStudio, Inc.)
9:00am–12:30pm Tuesday, 09/27/2016
Data science & advanced analytics
Location: 3D 10 Level: Intermediate
Tags: r-lang
Average rating: ****.
(4.20, 5 ratings)

Prerequisite knowledge

  • Experience programming with R
  • Materials or downloads needed in advance

  • A laptop (You'll be provided an RStudio Server Pro account preloaded with all course materials.)
  • What you'll learn

  • Learn how to analyze big data with R and sparklyr
  • Description

    Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale. With sparklyr, you write your commands in the same R dplyr syntax that you would use to manipulate data that is stored in memory or in an SQL database. Sparklyr translates your commands and runs them on Spark DataFrames, and you have access to all of the algorithms in Spark’s ML library, which makes practical machine learning scalable and easy. You can run classification, regression, clustering, and many more algorithms on your data inside a Spark cluster and connect to existing Spark clusters or create local Spark instances. Sparklyr lets you learn and develop with Spark locally and then run at scale.

    Topics include:

    • Connecting to Spark from R (the sparklyr package provides a complete dplyr backend)
    • Filtering and aggregating Spark datasets, then extracting them into R for analysis and visualization using popular tools like ggplot2 and R Markdown
    • Using Spark’s distributed machine-learning library from R
    Photo of Garrett Grolemund

    Garrett Grolemund


    Garrett Grolemund is a data scientist and chief instructor for RStudio, Inc. Garrett is a longtime user and advocate of R; he wrote the popular lubridate package for working with dates and times in R. Garrett designed and delivered the highly rated O’Reilly video series Introduction to Data Science with R and is the author of Hands-On Programming with R and the coauthor, with Hadley Wickham, of R for Data Science. He holds a PhD in statistics and specializes in teaching others how to do data science with open source tools.

    Photo of Nathan Stephens

    Nathan Stephens

    RStudio, Inc.

    Nathan Stephens recently joined RStudio as director of solutions engineering. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He was an early adopter of R and has introduced it into many organizations. Nathan holds an MS in statistics from Brigham Young University.

    Comments on this page are now closed.


    Rajesh Haran
    10/03/2016 5:34am EDT

    can you kindly post the slides pl?

    Kathy Yu
    08/19/2016 3:45pm EDT

    Hi Welye,

    To attend this tutorial (or any tutorial on 9/27), you have to get the Silver pass ($2145), which gives you access to all tutorials, keynotes, and sessions 9/27-9/29.

    You can check out the list of standard discounts on the registration page

    Weiye Deng
    08/19/2016 12:49pm EDT

    How much does it cost if I only want to take one course ‘R for big data’ ?