Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Sparklyr: An R interface for Apache Spark

Edgar Ruiz (RStudio)
2:40pm3:20pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Secondary topics:  R
Average rating: ****.
(4.80, 5 ratings)

Who is this presentation for?

  • Data scientists, data analysts, modelers, R users, Spark users, statisticians, and those in IT

What you'll learn

  • Discover how easy and practical it is to analyze big data with R and Spark
  • Learn what R is, what Spark is, how sparklyr works, and what is required to set up and tune a Spark cluster


Sparklyr, a free and open sourced package developed by RStudio in conjunction with IBM, Cloudera, and H2O, makes it easy and practical to analyze big data with R. The package provides an R interface to Spark’s distributed machine-learning algorithms and much more. With sparklyr, you can:

  • Interactively manipulate Spark data using both dplyr and SQL (via DBI)
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization
  • Orchestrate distributed machine learning from R using either Spark ML or H2O SparkingWater
  • Create extensions that call the full Spark API and provide interfaces to Spark packages
  • Establish Spark connections and browse Spark data frames within the RStudio IDE

Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.

Photo of Edgar Ruiz

Edgar Ruiz


Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.