One of the frustrations in data science is when the size of a problem crosses from being manageable on a laptop or a single server to being too big to fit in memory or taking too long to process. This often involves switching to a completely different environment and even a different language.
Apache Spark is the leader for distributed in-memory data analysis. It comes with advanced machine-learning modules and has interfaces with Scala, Python, and R. The SparkR project brings much of Spark’s capabilities to R but is still missing many of the machine-learning tools available with Python or Scala.
This year RStudio released the sparklyr package to provide tighter integration with RStudio IDE and Spark. Sparklyr provides a backend to the commonly used dplyr package, allowing R users who are familiar with dplyr to continue using this interface, and it provides much more in terms of machine learning and feature transformations.
Douglas Ashton, Aimee Gott, and Mark Sellors offer an overview of Apache Spark and the types of problems it can solve before walking you through hands-on examples covering the basics of working with distributed data, data manipulation, and machine learning. You’ll leave with everything you need to seamlessly scale your R data analysis to a distributed environment—without learning a entirely new language.
Doug Ashton is a senior data scientist at Mango Solutions, where he provides training and consultancy to a range of industries, from government to telecommunications and web retailers. Doug is a proponent of reproducible research and has spoken on such topics as reproducible environments and data analysis in teams.
As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.
Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org