With R and sparklyr, a Spark standalone cluster can be used to analyze large datasets stored in S3 buckets. Unlike running Spark on a YARN managed cluster, a standalone cluster separates the computation from the data. This has some novel benefits: it potentially saves money because it doesn’t persist the data; the cluster can be turned off or even terminated at will; it’s a safer and faster alternative if the standalone cluster is built using EC2 instances; and all the data is moved and analyzed inside AWS.
Drawing on the information presented in his article “Using Spark standalone mode and S3,” Edgar Ruiz walks you through setting up a Spark standalone cluster using EC2 and offers an overview of S3 bucket folder and file setup, connecting R to Spark, the settings needed to read S3 data into Spark, and a data import and wrangle approach.
Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org