Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

R and reproducible reporting for big data

Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions), 5abd3266 b21b4657 (Mango Solutions)
13:30–17:00 Wednesday, 1/06/2016
Data science & advanced analytics
Location: Capital Suite 13 Level: Intermediate
Average rating: ****.
(4.67, 3 ratings)

Prerequisite knowledge

Attendees should have a basic knowledge of R and should have used R previously for basic tasks such as reading and writing data, graphics, and basic analysis.

Materials or downloads needed in advance

Attendees will need an Internet-connected laptop with a modern web browser.


R has become one of the main tools in the data science pipeline for analysis and reporting, but many users don’t know what to do when their data gets big. If the data can’t be stored in memory, can we still use R?

Aimee Gott, Mark Sellors, and Richard Pugh explore techniques for optimizing your workflow in R when working with big data, including how to efficiently extract data from a database, techniques for visualization and analysis, and how all of this can be incorporated into a single, reproducible report, directly from R.

Aimee, Mark, and Richard start with an introduction to the manipulation functionality of dplyr for in-memory data before explaining how to use the package to work directly on the database. They look at how to connect to a database and get data into R without needing to write SQL, with a focus on best practices. After getting comfortable with getting subsets of the data from the database, Aimee, Mark, and Richard demonstrate how to use summaries of the data to generate graphics and perform simple analysis and prototyping before bringing it all together in the form of reproducible reporting with RMarkdown. Aimee, Mark, and Richard introduce how these techniques can be brought together in a single document.

Photo of Aimee Gott

Aimee Gott

Mango Solutions

As training lead at Mango, Aimee Gott has delivered over 200 days of training, including onsite training courses in Europe and the US in all aspects of R as well as shorter workshops and online webinars. Aimee oversees Mango’s training course development across the data science pipeline and regularly attends R user groups and meetups. Aimee is also a coauthor of Sams Teach Yourself R in 24 Hours. Aimee holds a PhD in statistics from Lancaster University.

Photo of Mark Sellors

Mark Sellors

Mango Solutions

Mark Sellors is head of data engineering for Mango Solutions, where he helps clients run their data science operations in production-class environments. Mark has extensive experience in analytic computing and helping organizations in sectors from government to pharma to telecoms get the most from their data engineering environments.

Photo of 5abd3266 b21b4657

5abd3266 b21b4657

Mango Solutions

Richard Pugh is cofounder and chief data scientist at Mango.