R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.