Skip to main content

How to do Predictive Analytics with Limited Data

Ulrich Rueckert (Datameer)
Data Science Beekman Parlor - Sutton North
Average rating: ****.
(4.50, 6 ratings)
Slides:   1-PDF 

It is frustrating: even with petabytes of data on a Hadoop cluster, one still encounters situations where there’s a lack of key data for a wide variety of big data analytic use cases. You might have billions of clicks on your web site, but only a few users choose to rate a product. There might be millions of text documents on your cluster, but it is too expensive to have someone categorize more than a tiny fraction of them. In principle, this is where predictive modeling could help. For instance, one could learn a model to predict user ratings so you can better serve product recommendations based on those expected ratings. Or, one could create a model to automatically categorize text documents, saving countless hours and dollars. The main problem is that there is only a limited amount of training material (i.e. user ratings, categorized documents) and it is thus hard to generate good models.

As it turns out, recent research on machine learning techniques has found a way to deal effectively with such situations with a technique called semi-supervised learning. These techniques are often able to leverage the vast amount of related, but unlabeled data to generate accurate models. In this talk, we will give an overview of the most common techniques including co-training regularization. We first explain the principles and underlying assumptions of semi-supervised learning and then show how to implement such methods with Hadoop. Finally, we explain best practices and illustrate them with a demo.

Photo of Ulrich Rueckert

Ulrich Rueckert


Ulrich Rueckert is Data Scientist at Datameer. Prior to Datameer he worked as a research scholar at UC Berkeley and the International Computer Science Institute. His research on machine learning and data mining has been published in renowned journals and has won awards at international conferences. Ulrich serves on the program committees of the main machine learning conferences and he has organized workshops and held tutorials on his research.

Comments on this page are now closed.


Marek K Kolodziej
10/30/2013 4:32pm EDT

Would it be possible to post the slides here, like the other speakers have?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts