Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Using R and Spark to analyze data on Amazon S3

Edgar Ruiz (RStudio)
1:15pm1:55pm Thursday, September 28, 2017
Big data and the Cloud, Machine Learning & Data Science
Location: 1A 08/10 Level: Intermediate
Secondary topics:  Cloud, R
Average rating: ****.
(4.00, 1 rating)

Who is this presentation for?

  • Data scientists, R developers, Spark users, big data architects, cloud architects, and those in IT

Prerequisite knowledge

  • A working knowledge of R, the EC2 and S3 services in AWS, and Spark

What you'll learn

  • Understand how to use Spark and sparklyr to enable the analysis of S3 data in R


With R and sparklyr, a Spark standalone cluster can be used to analyze large datasets stored in S3 buckets. Unlike running Spark on a YARN managed cluster, a standalone cluster separates the computation from the data. This has some novel benefits: it potentially saves money because it doesn’t persist the data; the cluster can be turned off or even terminated at will; it’s a safer and faster alternative if the standalone cluster is built using EC2 instances; and all the data is moved and analyzed inside AWS.

Drawing on the information presented in his article “Using Spark standalone mode and S3,” Edgar Ruiz walks you through setting up a Spark standalone cluster using EC2 and offers an overview of S3 bucket folder and file setup, connecting R to Spark, the settings needed to read S3 data into Spark, and a data import and wrangle approach.

Photo of Edgar Ruiz

Edgar Ruiz


Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.