Over 15 million learners have taken courses on the Coursera platform over the last three years. The interaction of our learners with these complex courses (containing various items like quizzes, peer-review assessments, videos, slides, readings, and programming assignments) has resulted in a massive proliferation of both semistructured and unstructured data. Representing these diverse data sources (with Cassandra, Kafka, and MySQL) offers interesting data-modeling and data-integration problems.
Roshan Sumbaly and Thomas Barthelemy cover various lessons learned while standardizing Coursera’s ETL and eventing system using Scalding and Kafka respectively, building a data warehouse on Amazon Redshift with various modeling requirements, and tying all of the pieces together with Coursera’s open source pipeline manager, dataduct, built on top of Amazon Data Pipeline. These lessons act as a guidebook for new startups dealing with the setup of data infrastructure.
Over time it became important for Coursera to use this rich data to generate insightful data products for its instructors and learners. Roshan and Thomas explain how their team settled on frameworks that allowed them to build both batch- and streaming-based data products and discuss the internally built developer ecosystem that has allowed Coursera to quickly iterate on its diverse data-products portfolio.
Roshan Sumbaly is an engineering manager at Facebook, where he leads computer vision efforts focused on visual people understanding and infrastructure. Previously, he led various teams at Coursera and LinkedIn, working on data products and infrastructure.
Pierre Thomas Barthelemy is the engineering lead of the Data Infrastructure team at Coursera. The team is responsible for introducing core data systems (e.g., data warehouse using Redshift, ETL using Data Pipeline and Scalding), while also helping build products that create a developer-friendly ecosystem for easy iteration on data-driven products (e.g. data-modeling and visualization tools, the A/B-testing platform, and an online datastore for data products). Before Coursera, Thomas was a graduate student at Stanford.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.