We are awash in a diversity of data – data which enables us to create new highly differentiated, dynamic products and services built around analyzing this data for users. However, the ubiquity of this data also creates problems – many data sets are too controversial for all users to rely on – users need to be able to opt out; plus, it can be very difficult to combine data from different sources into a single model using standard models, resulting in very long R&D cycles that can make integrating novel data sources together into a single model unfeasible. We need a new paradigm for model generation that allows us to integrate new data rapidly and simply – but without growing the complexity of our models – and we need a new paradigm for giving users control of the data sources in their outsourced analytics.
Ensemble models provide a solution to this problem. These models approach the problem by assigning the same problem to many different models and summarizing the results into a single value – much like we’ve learned to do with the collective intelligence of users – creating a new collective artificial intelligence. By building many discrete models based on discrete data sources and combining them at a higher level we can control the complexity of modeling real-world data problems, allow new data sources to be introduced quickly and easily, and provide a more accurate view of the world when many legitimate versions of the truth exist.
Additionally, these models have many advantages as products. The research and development effort to assimilate data and make effective predictions from it can be implemented in parallel rather than as a monolithic task by teams with no knowledge of the other data sources and modeling approaches. Consumers of the model can easily drop models and approaches in and out of the larger calculation, and the model can easily degrade itself when presented with lesser data. As a result, by saving ourselves effort, we can create more engaging, powerful and customizable analytics for users – no two users need agree on what data and approach should be used in their analytics.
We’ll go over basic enemble model methods and talk about how they applied to estimated the value of real estate property, determining the value of a stock portfolio, and the winning solution to the Netflix Prize – and discuss ways in which these ideas can be extended into your business.
Jesper Andersen is the Product Manager for Data and Econometrics at Trulia.com and is the co-founder of Freerisk and is a statistician, computer scientist and entrepreneur. He’s spoken internationally on finance and statistical systems and is the founder of Freerisk.org, a startup focusing on making providing transparent and diverse financial metrics. Previously he was the lead architect at Visible Path which was sold in 2008.
Jesper holds a B.Sc. in Physics from Haverford College and an M.B.A. from University of Chicago’s Booth School of Business, where he received the Vijay Vashee “Most Promising Entrepreneur” award in 2008.
Comments on this page are now closed.