Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale. With sparklyr, you write your commands in the same R dplyr syntax that you would use to manipulate data that is stored in memory or in an SQL database. Sparklyr translates your commands and runs them on Spark DataFrames, and you have access to all of the algorithms in Spark’s ML library, which makes practical machine learning scalable and easy. You can run classification, regression, clustering, and many more algorithms on your data inside a Spark cluster and connect to existing Spark clusters or create local Spark instances. Sparklyr lets you learn and develop with Spark locally and then run at scale.
Garrett Grolemund is a data scientist and chief instructor for RStudio, Inc. Garrett is a longtime user and advocate of R; he wrote the popular lubridate package for working with dates and times in R. Garrett designed and delivered the highly rated O’Reilly video series Introduction to Data Science with R and is the author of Hands-On Programming with R and the coauthor, with Hadley Wickham, of R for Data Science. He holds a PhD in statistics and specializes in teaching others how to do data science with open source tools.
Nathan Stephens recently joined RStudio as director of solutions engineering. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He was an early adopter of R and has introduced it into many organizations. Nathan holds an MS in statistics from Brigham Young University.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.