Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Supercharging R with Spark for end-to-end data science

Hossein Falaki (Databricks Inc.)
1:15pm–1:55pm Wednesday, 09/30/2015
Spark & Beyond
Location: 1 E20 / 1 E21 Level: Intermediate
Average rating: ***..
(3.65, 26 ratings)

R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.

Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.

Photo of Hossein Falaki

Hossein Falaki

Databricks Inc.

Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).

Comments on this page are now closed.

Comments

Picture of Hossein Falaki
Hossein Falaki
10/01/2015 6:29am EDT

The slides are available here

Adam Preston
09/30/2015 1:58pm EDT

I too would the slides and the code.

Joseph Benitez
09/30/2015 10:25am EDT

Another person already asked but would really like to get copy of slides that you presented in the session. Pretty interesting!

Yang Guo
09/30/2015 10:17am EDT

Hi Hossein, would it be possible to get a copy of the slides and example code?