Skip to main content

Predictive Modeling in the Cloud with Scikit-learn and IPython

Olivier Grisel (INRIA)
Data Science
Ballroom AB
Average rating: ***..
(3.86, 7 ratings)

IPython with its notebook interface is an interactive programming environment that is particularly well suited for data exploration, modelling and sharing of analysis results notably via

Scikit-learn a versatile Machine Learning library for Python that blends well with the NumPy and SciPy ecosystem and is used by a growing user-base of both academic researchers and data scientists and engineers in the tech industry.

The two projects offer together a productive environment for building and evaluating predictive models from data. In particular IPython distributed computing capabilities make it possible to offload computational intensive Machine Learning tasks to clusters of tens or hundreds of nodes without breaking the interactive experience.

The goal of the presentation is to showcase how to setup an ad hoc data modelling environment using a cluster provisioned in a public cloud and use it perform common predictive modelling operations such as:

  • cross-validated model assessment and automated search for the best parameters for common feature extraction and machine learning algorithms,
  • parallel training of out-of-core text classification models for sentiment analysis,
  • parallel training of large randomized ensembles of decision trees (a.k.a. Random Forests).

Olivier Grisel

Software Engineer, INRIA

Olivier Grisel is a software engineer in the Parietal team of INRIA. He works to improve the speed and scalability of the scikit-learn machine learning library for the Python / Numpy / Scipy ecosystem. He also likes to share interesting Machine Learning papers and tricks on twitter: @ogrisel