R has become one of the main tools in the data science pipeline for analysis and reporting, but many users don’t know what to do when their data gets big. If the data can’t be stored in memory, can we still use R?
Aimee Gott, Mark Sellors, and Richard Pugh explore techniques for optimizing your workflow in R when working with big data, including how to efficiently extract data from a database, techniques for visualization and analysis, and how all of this can be incorporated into a single, reproducible report, directly from R.
Aimee, Mark, and Richard start with an introduction to the manipulation functionality of dplyr for in-memory data before explaining how to use the package to work directly on the database. They look at how to connect to a database and get data into R without needing to write SQL, with a focus on best practices. After getting comfortable with getting subsets of the data from the database, Aimee, Mark, and Richard demonstrate how to use summaries of the data to generate graphics and perform simple analysis and prototyping before bringing it all together in the form of reproducible reporting with RMarkdown. Aimee, Mark, and Richard introduce how these techniques can be brought together in a single document.
As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.
Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.
Richard Pugh is cofounder and chief data scientist at Mango.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.