Journey or Destination: Using Models to Explore Big Data

Ben Gimpert (Altos Research)
Location: Sutton South Level: Intermediate

Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.

This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.

A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.

Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?

My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.

Photo of Ben Gimpert

Ben Gimpert

Altos Research

He was a professional software developer for ten years, and has been hacking code for much longer. Ben’s past clients include investment banks like JPMorgan Chase and Credit Suisse, the hedge fund Natura Capital, and EdF Trading an energy trading house. He built a taxonomy browser for Encyclopaedia Britannica in 2004, and previously worked for ThoughtWorks as a convert to agile software engineering.

Ben teaches and speaks on machine learning, software engineering, financial analysis, and the culture of quants. While living in London, Ben was an early contributor to the grassroots cartography project OpenStreetMap. He continues to manage a portfolio of financial assets via a quantitative trading strategy built upon sentiment and predictive analytics. He has an MSc in Finance from London Business School and a BEng in Computer Science from Northwestern University.


  • Aster Data
  • EMC Greenplum
  • GE
  • Lexis Nexis
  • MarkLogic
  • Tableau Software
  • Cloudera
  • DataStax
  • Informatica
  • DataSift
  • Splunk
  • Amazon Web Services
  • Datameer
  • Impetus
  • Karmasphere
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Sybase
  • Xeround
  • Media-Science
  • Platfora

Sponsorship Opportunities

For information on sponsorship opportunities at the conference, contact Susan Stewart at

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata Contacts