Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.
The official Jupyter Conference
August 22-23, 2017: Training
August 23-25, 2017: Tutorials & Conference
New York, NY

Lessons learned from tens of thousands of Kaggle notebooks

Megan Risdal (Kaggle), Wendy Chih-wen Kan (Kaggle)
5:00pm–5:40pm Thursday, August 24, 2017
Reproducible research and open science
Location: Murray Hill Level: Beginner
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Anyone who uses open source data science languages (i.e., R and Python) to work with data

Prerequisite knowledge

  • A basic understanding of machine learning (useful but not required)

What you'll learn

  • Understand the benefits of collaborative data science
  • Learn how to share and work on data projects using Kaggle Kernels


Sharing and building off insights in collaboration with others is integral to open data science. On Kaggle, the data science community uses Kernels as a platform to share reproducible code, data, and knowledge. Since the introduction of code sharing on Kaggle in 2015, users have written tens of thousand of kernels, of which 45% are R, Python, and Julia notebooks. Over this time, Kernels has transformed how Kagglers tackle competitive machine learning problems, collaborate, and learn.

Megan Risdal and Wendy Chih-wen Kan discuss what Kernels has taught Kaggle about collaborative data science. Megan and Wendy begin by highlighting how code sharing in competitions has allowed users to learn and incorporate ideas and approaches from others, ultimately raising the competitive bar while fostering an online culture more inclusive to data scientists of all skill levels. They then describe how Kernels combined with public datasets published on Kaggle creates a repository of knowledge and reproducible analyses around high-value data. They conclude by demonstrating the ingredients of a “successful” notebook on Kaggle, based on community metrics.

Photo of Megan Risdal

Megan Risdal


Megan Risdal is a marketing manager at Kaggle. She holds master’s degrees in linguistics from the University of California, Los Angeles, and North Carolina State University. Her curiosities lie at the intersection of data, science, language, and learning.

Photo of Wendy Chih-wen Kan

Wendy Chih-wen Kan


Wendy Kan is a data scientist at Kaggle, the largest global data science community, where she works with companies and organizations to transform their data into machine learning competitions. Previously, Wendy was a software engineer and researcher. She holds BS and MS degrees in electrical engineering and a PhD in biomedical engineering.