Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Spark and R with sparklyr

Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
13:3017:00 Tuesday, 23 May 2017
Big data and the Cloud, Spark & beyond
Location: Capital Suite 9
Level: Intermediate
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data scientists, analysts, and data engineers

Prerequisite knowledge

  • Intermediate knowledge of R and the dplyr package

Materials or downloads needed in advance

  • A laptop with a browser installed (You'll be provided a cloud instance of RStudio with sparklyr set up, but you may wish to install Spark on your laptop, which requires current RStudio and Java installations and running the install_spark function from sparklyr.)

What you'll learn

  • Learn how to use Spark with the sparklyr package


One of the frustrations in data science is when the size of a problem crosses from being manageable on a laptop or a single server to being too big to fit in memory or taking too long to process. This often involves switching to a completely different environment and even a different language.

Apache Spark is the leader for distributed in-memory data analysis. It comes with advanced machine-learning modules and has interfaces with Scala, Python, and R. The SparkR project brings much of Spark’s capabilities to R but is still missing many of the machine-learning tools available with Python or Scala.

This year RStudio released the sparklyr package to provide tighter integration with RStudio IDE and Spark. Sparklyr provides a backend to the commonly used dplyr package, allowing R users who are familiar with dplyr to continue using this interface, and it provides much more in terms of machine learning and feature transformations.

Douglas Ashton, Aimee Gott, and Mark Sellors offer an overview of Apache Spark and the types of problems it can solve before walking you through hands-on examples covering the basics of working with distributed data, data manipulation, and machine learning. You’ll leave with everything you need to seamlessly scale your R data analysis to a distributed environment—without learning a entirely new language.

Photo of Douglas Ashton

Douglas Ashton

Mango Solutions

Doug Ashton is a senior data scientist at Mango Solutions, where he provides training and consultancy to a range of industries, from government to telecommunications and web retailers. Doug is a proponent of reproducible research and has spoken on such topics as reproducible environments and data analysis in teams.

Photo of Aimee Gott

Aimee Gott

Mango Solutions

As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.

Photo of Mark Sellors

Mark Sellors

Mango Solutions

Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.