dplyr, one of the most popular packages for R, provides a consistent grammar for data manipulation that can abstract over diverse data sources. dplyr can work with in-memory data frames and can also efficiently query large-scale data with processing engines including Apache Spark and Apache Impala (incubating). But dplyr works differently with these different data sources—and the differences can be sneaky.
Ian Cook demonstrates several dplyr-compatible interfaces, including sparklyr (from RStudio) and the new package implyr (from Cloudera), and offers tips for writing dplyr code that works across these different interfaces. He helps solve mysteries including:
Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org