Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

sparklyr, implyr, and more: dplyr interfaces to large-scale data

Ian Cook (Cloudera)
2:40pm3:20pm Thursday, March 8, 2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Average rating: ****.
(4.75, 4 ratings)

Who is this presentation for?

  • Data analysts and data scientists using R

Prerequisite knowledge

  • Experience using R, preferably including dplyr and other tidyverse packages

What you'll learn

  • Learn how to use the popular R package dplyr with different large-scale data processing engines, including Apache Spark and Apache Impala (incubating)

Description

dplyr, one of the most popular packages for R, provides a consistent grammar for data manipulation that can abstract over diverse data sources. dplyr can work with in-memory data frames and can also efficiently query large-scale data with processing engines including Apache Spark and Apache Impala (incubating). But dplyr works differently with these different data sources—and the differences can be sneaky.

Ian Cook demonstrates several dplyr-compatible interfaces, including sparklyr (from RStudio) and the new package implyr (from Cloudera), and offers tips for writing dplyr code that works across these different interfaces. He helps solve mysteries including:

  • Do I need to know SQL to use dplyr?
  • When is a “tbl” not a “tibble”?
  • Why is 1 not always equal to 1?
  • When should you collect(), collapse(), and compute()?
  • How can you use dplyr to combine data stored in different systems?
Photo of Ian Cook

Ian Cook

Cloudera

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.