Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Data applications and infrastructure at Coursera

Roshan Sumbaly (Facebook), Pierre Barthelemy (Coursera)
11:50am–12:30pm Thursday, 03/31/2016
Data Innovations

Location: LL21 E/F
Tags: education
Average rating: ***..
(3.90, 10 ratings)

Prerequisite knowledge

Attendees should have a basic understanding of data-infrastructure ecosystems.


Over 15 million learners have taken courses on the Coursera platform over the last three years. The interaction of our learners with these complex courses (containing various items like quizzes, peer-review assessments, videos, slides, readings, and programming assignments) has resulted in a massive proliferation of both semistructured and unstructured data. Representing these diverse data sources (with Cassandra, Kafka, and MySQL) offers interesting data-modeling and data-integration problems.

Roshan Sumbaly and Thomas Barthelemy cover various lessons learned while standardizing Coursera’s ETL and eventing system using Scalding and Kafka respectively, building a data warehouse on Amazon Redshift with various modeling requirements, and tying all of the pieces together with Coursera’s open source pipeline manager, dataduct, built on top of Amazon Data Pipeline. These lessons act as a guidebook for new startups dealing with the setup of data infrastructure.

Over time it became important for Coursera to use this rich data to generate insightful data products for its instructors and learners. Roshan and Thomas explain how their team settled on frameworks that allowed them to build both batch- and streaming-based data products and discuss the internally built developer ecosystem that has allowed Coursera to quickly iterate on its diverse data-products portfolio.

Photo of Roshan Sumbaly

Roshan Sumbaly


Roshan Sumbaly is an engineering manager at Facebook, where he leads computer vision efforts focused on visual people understanding and infrastructure. Previously, he led various teams at Coursera and LinkedIn, working on data products and infrastructure.

Photo of Pierre Barthelemy

Pierre Barthelemy


Pierre Thomas Barthelemy is the engineering lead of the Data Infrastructure team at Coursera. The team is responsible for introducing core data systems (e.g., data warehouse using Redshift, ETL using Data Pipeline and Scalding), while also helping build products that create a developer-friendly ecosystem for easy iteration on data-driven products (e.g. data-modeling and visualization tools, the A/B-testing platform, and an online datastore for data products). Before Coursera, Thomas was a graduate student at Stanford.