dplyr, one of the most popular packages for R, provides a consistent grammar for data manipulation that can abstract over diverse data sources. dplyr can work with in-memory data frames and can also efficiently query large-scale data with processing engines including Apache Spark and Apache Impala (incubating). But dplyr works differently with these different data sources—and the differences can be sneaky.
Ian Cook demonstrates several dplyr-compatible interfaces, including sparklyr (from RStudio) and the new package implyr (from Cloudera), and offers tips for writing dplyr code that works across these different interfaces. He helps solve mysteries including:
Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at AMD. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com