A data scientist usually starts with some hypothesis about what should go in a predictive model. Since a model is a simplified representation of the world, we will never conclusively know that a variable causes some effect. The data scientist can only try to differentiate correlation and causation, by measuring a variable’s predictive power across a long time-span. Perhaps even across the whole lifetime of the project.
In reality a hypothesis of important variables is shaped by the technical realities of the organization. Whatever data the organization collects is usually the place to start looking for predictive power. Something buried away in the Hadoop cluster must be correlated with customer churn, right?! The data scientist also reviews the academic literature for finding data worth her attention. Which variables turn up in an researcher’s toy model? Or on that well-capitalized dream project, the data scientist considers third-party sources. Does anyone have a chunk of data that predicts our churn? The data scientist inevitably arrives at the point where the extent of a variable’s usefulness needs to be measured. In other words, how do we choose which variables to include in our model?
The big data community stands on the shoulders of several giants. Appropriately there are many different approaches to estimating variable importance. A traditional statistician might calculate p-statistics, while a time-series modeller measures improvement in blind, out-of-sample forecasting accuracy. And the machine learning expert may use Shannon’s entropy to calculate the information gain of a variable. Each of these measures makes assumptions about the process being modeled. For example, the data scientist with a time-series bent must decide the minimum amount of hold-out data that still represents a population. All this nuance in variable importance is worth negotiating, because feature selection has a multiplicative effect on the overall modeling process. Good variable importance and feature selection means:
Feature selection is something of an open secret. An interview with a candidate for a data scientist role is probably going well if you spend most of the time talking about variable importance, and only touch on the “secret code” of a particular machine learning algorithm at the interview’s end.
Measuring the predictive power of a variable is even useful outside the context of training a predictive model. If three variables sampled every day have done a decent job of predicting customer churn, an organization might spend the money to sample those variables every hour or every minute. This is an example of data ecology and data feedback. Measuring variable importance pushes the organization to change how it collects and analyzes its data. Given enough time, this feedback loop optimizes how the organization as a whole approaches data.
My presentation introduces feature selection in the context of real-life trade-offs. I will briefly cover several variable importance measures, and also discuss the difference between variable reduction (regularization), and variable selection. We will also contrast on-the-fly variable importance (i.e. LARS) with a separate selection pass. Principal components analysis will be critiqued as a “black box” approach to dimensionality reduction, but one that fails to provide useful feedback the organization. Since important variables are often useful when clustering, I will also touch on unsupervised learning. Throughout the presentation I will refer to real-life use cases from my work with Altos Research, and our forecasting model of residential real estate.
Ben was a professional software developer for ten years, and a BBS scenester in the mid-nineties. He is also one of those annoying former quants. Ben’s past clients include investment banks like JPMorgan Chase and Credit Suisse, the hedge fund Natura Capital, and EdF Trading an energy trading house. He built a taxonomy browser for Encyclopaedia Britannica in 2004, and previously worked for ThoughtWorks as a convert to agile software engineering.
Ben teaches and speaks on machine learning, software engineering, financial analysis, and the culture of quants. While living in London, Ben was an early contributor to the grassroots cartography project OpenStreetMap. He continues to manage a portfolio of financial assets via a quantitative trading strategy built upon sentiment and predictive analytics. He has an MSc in Finance from London Business School and a BEng in Computer Science from Northwestern University.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at firstname.lastname@example.org.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts