Best Practices for Building and Deploying Predictive Models over Big Data

Business & Industry Data Science, Grand East (NY Hilton)
Tutorial Please note: to attend, your registration must include Tutorials.
Average rating: ****.
(4.25, 4 ratings)

In this tutorial, we show how open source tools can be used for the entire life cycle of a predictive model built over big data. Specifically, for anyone who has built a model, we show how to: 1) perform an exploratory data analysis (EDA) of data managed by Hadoop using R and other open source tools; 2) leverage the EDA to build analytic and statistical models over data managed by Hadoop; 3) deploy these models into operational systems; and 4) measure the performance of the models and continuously improve them.

We cover the following topics:

  • Three simple techniques for exploratory data analysis (EDA) over Hadoop
  • Four ways to interoperate Hadoop and R, including RHIPE and R+Hadoop
  • Building analytic models over Hadoop using R and other open source tools
  • Why you should use multiple models (segmented models and ensembles of models) when building models over Hadoop
  • Languages for describing predictive models, including the Predictive Model Markup Language
  • Model producers and model consumers (scoring engines)
  • Integrating scoring engines into operational systems
  • Evaluating the effectiveness of a model
  • The continuous improvement of a model
Photo of Robert Grossman

Robert Grossman

Open Data Group

Robert Grossman (@bobgrossman) is the Founder and a Partner of Open Data Group, which specializes in building predictive models over big data. He is a Core Faculty and Senior Fellow at the Institute for Genomics and Systems Biology (IGSB) and the Computation Institute (CI) at the University of Chicago. He has led the development of new open source software tools for analyzing big data, cloud computing, data mining, distributed computing and high performance networking. Prior to starting Open Data Group, he founded Magnify, Inc. in 1996, which provides data mining solutions to the insurance industry. Grossman was Magnify’s CEO until 2001 and its Chairman until it was sold to ChoicePoint in 2005. He blogs about big data, data science, and data engineering at

Collin Bennett

Open Data Group

Collin Bennett is a principal at Open Data
. In three and a half years with the company, Collin has worked on the open source Augustus scoring engine and a cloud-based environment for rapid analytic prototyping called RAP. Additionally, he has released open source projects for the Open Cloud Consortium. One of these, MalGen, has been used to benchmark several parallel computation frameworks. Previously, he led software development for the Product Development Team at Acquity Group, an IT consulting firm head-quartered in Chicago. He also worked at startups Orbitz (when it was still was one) and Business Logic Corporation. He has co-authored papers on Weyl tensors, large data clouds, and high performance wide area cloud testbeds. He holds degrees in English, Mathematics and Computer Science.

Comments on this page are now closed.


Picture of Robert Grossman
Robert Grossman
10/24/2012 12:43pm EDT

I just uploaded them. You should be able to see them shortly. You can also find the slides, along with some additional material and background information about what was presented at

Vladimir Korolev
10/24/2012 4:59am EDT

Great session. Are you going to post the slides here?


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.