Just the Basics: Core Data Science Skills with Kaggle’s Top Competitors

Data Science Ballroom AB
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Presentation: external link
Average rating: ***..
(3.67, 12 ratings)

*If you are signed up for this tutorial, you will need to be prepared with the following, before you arrive onsite:

  • Up-to-date R installation (also an IDE, like R-Studio)
  • Python 2.7 with a package manager (pip or easy_install) and the following packages: numpy, scipy, scikits-learn, pandas, matplotlib, ipython
  • Git and a Github account
  • A text editor
  • Excel (not mandatory, but may be useful)
  • Matlab/Octave (not mandatory, but may be useful)
  • Other modeling software or languages you prefer (for contest use)*

This tutorial will target people with basic programming experience to introduce them to the end-to-end analysis of predictive data problems. We will cover the topics in a largely language-agnostic way, drawing on examples from R and Python. The tutorial is comprised of four sections. The last of section will be a hands-on Kaggle competition in which participants can experience firsthand the joys of creating a model and the sorrows of overfitting:

  • Identifying a problem (30 min)

- Identifying opportunities to collect data

- Reading data into a useful format

- Understanding limitations in the data

  • Performing the analysis (45 min)

- Feature extraction

- Basic prediction methods

- Cross validation

- Numerical ways to assess performance

  • Visualizing the solution (30 min)

- Showing the results

- Telling a story through visualization

  • Hands-on, for-fun contest (75 min)
Photo of William Cukierski

William Cukierski


William Cukierski is a data scientist at Kaggle. He has a bachelor’s degree in physics from Cornell University and a Ph.D. in biomedical engineering from Rutgers University, where he studied applications of machine learning in cancer research. Prior to joining Kaggle, he finished competitively in predictive data competitions on topics ranging from predicting stock movements, to forecasting grocery shopping, to automated essay grading.

Photo of Ben Hamner

Ben Hamner


Ben Hamner is responsible for data analysis, machine learning, and competitions at Kaggle. He has worked with machine learning problems in a variety of different domains, including natural language processing, computer vision, web classification, and neuroscience. Prior to joining Kaggle, he applied machine learning to improve brain-computer interfaces as a Whitaker Fellow at the École Polytechnique Fédérale de Lausanne in Lausanne, Switzerland. He graduated with a BSE in Biomedical Engineering, Electrical Engineering, and Math from Duke University.

Comments on this page are now closed.


Picture of William Cukierski
William Cukierski
02/27/2013 2:45am PST

We’ll post code and slides here: https://www.kaggle.com/c/just-the-basics-strata-2013/forums/t/3939/code-and-slides

Amanda Baker
02/26/2013 8:07am PST

Will you be posting all R, Matlab, and iPython Notebook code here? I’d highly appreciate it. Thank you!

Picture of Phillip Burger
Phillip Burger
02/26/2013 3:05am PST

Regarding the number of estimators used in a Random Forest model, starting with 100 is usual. Keeping all other model parameters constant, do you just increase and decrease the number of estimators and re-run the model to find the value of the n_estimators that is best?

What is the interaction between the n_estimators and other model parameters? Will the optimum value of n_estimators value determined above change once we start tuning the other parameters in the model? After tuning the n_estimators parameter, what do you suggest as the next two parameters to start tuning? Can you tune in a linear fashion like this? Or, is the interaction of the parameters too complex to stick to with one, disciplined approach to tuning?

Ronald Karunia
02/22/2013 9:15am PST

How to register for this tutorial only?

Picture of Sophia DeMartini
Sophia DeMartini
02/19/2013 10:54am PST

Hi Siyun,

You just need a basic programming ability in order to attend this tutorial. If you do attend the tutorial, just make sure to come prepared with the items listed above.
Picture of Siyun Fan
Siyun Fan
02/19/2013 10:05am PST

Hi, I am wondering what is the prerequisites for this tutorial? Specifically R and Python coding skill level. Thanks!

Kathy Yu
11/14/2012 4:19am PST

You can sign up for the Strata Newsletter to get news & updates (including updates when the video compilation becomes available)

Kathy Yu
11/14/2012 4:09am PST

Hi Miquel – thanks for your interest in this session. We record all tutorials, sessions, and keynotes (pending speaker consent) as a part of the Complete Video Compilation, available for sale a few weeks after the event.

Miquel Llobet
11/13/2012 8:50am PST

First of all many thanks for organising such an event, I can tell you it looks amazing for a newbie like myself and can imagine the work that goes into this.

Is a video from the conference going to be uploaded? I am really interested in attending but it’s impossible for me as I’m not studying in the US :( It would be great so more people could take advantage of it!

Thanks in advance!


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts