Brought to you by NumFOCUS Foundation and O’Reilly Media
The official Jupyter Conference
Aug 21-22, 2018: Training
Aug 22-24, 2018: Tutorials & Conference
New York, NY

Predicting Abatement: Modeling a Small Dataset with High Variance

Moderated by: Sonyah Seiden

This project aims to gain understanding of working with a small dataset with high variance. Utilizing metrics of production and emissions trends, the target (abatement in metric tons of CO2) was developed and passed through 3 regressive models to test different approaches to incorporate variance into the modeling process.

The data spanned 27 countries over 25 years, and each model trained and tested on historical data with 15 lags. The training model aimed to predict abatement from 2005-2013, and the test model predicted 2014. Models tested were traditional autoregressive model, Random Forest Regressor from scikitlearn and Bayesian autoregression with pymc3. Before the modeling process the data was passed through K-Means to develop cohorts for modeling.

A primary focus of this is to assess what Bayesian regression can offer us in terms of incorporating distributions into future predictions. Overall, the model performed better than traditional Autoregressive models, but in some cases not as well as Random Forest Regressor. The traditional autoregressive model acted as a baseline, while Random Forest Regressor was included to test the efficacy of an ensemble method on making difficult predictions. The Bayesian model is distinguished from the other regressions by incorporating variance through informed prior distributions.

With the traditional autoregressive model, it often was able to predict increases in abatement, or capture the peaks in trends, but often lost integrity when predicting troughs. The best Bayesian model was built for Cohort 1, which included the largest group of countries (topping off at 25 of the 35 countries). It scored with a 0.8 for train and 0.64 for test (R2), a marked improvement from the traditional regression’s score of 1 on train and -1.18 on test. Across all cohorts, Random Forest Regressor and traditional autoregressive model performed the best; however, the two cohorts modeled using Bayesian autoregression had divergent results, proving themselves to be excellent subsets to explore the nature of data Bayes is applicable to, as well as how it can be fine-tuned.

This project entailed building a series of loops that generated lagged train and test datasets, trained models and generated predictions for cohorts, and returned useable markdown tables. For the Bayesian autoregression, this entailed developing informed priors by passing individual countries through MCMC, generating a series of 2,000 predictions for each target, selecting from the distributions, and “ensembling” them to determine R2. Additionally, it demanded meticulous organization and variable management within code. Each notebook stands and runs through independently.