Tag: random forest

Forecasting with Random Forests

When it comes to forecasting data (time series or other types of series), people look to things like basic regression, ARIMA, ARMA, GARCH, or even Prophet but don’t discount the use of Random Forests for forecasting data.

Random Forests are generally considered a classification technique but regression is definitely something that Random Forests can handle.

For this post, I am going to use a dataset found here called Sales Prices of Houses in the City of Windsor (CSV here, description here).  For the purposes of this post, I’ll only use the price and lotsize columns. Note: In a future post, I’m planning to resist this data and perform multivariate regression with Random Forests.

To get started, let’s import all the necessary libraries to get started. As always, you can grab a jupyter notebook to run through this analysis yourself here.

import pandas as pd
import matplotlib.pyplot as plt
# lets set the figure size and color scheme for plots
# personal preference and not needed).
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Now, lets load the data:

df = pd.read_csv('../examples/Housing.csv')
df = df[['price', 'lotsize']]

Again, we are only using two columns from the data set – price and lotsize. Let’s plot this data to take a look at it visually to see if it makes sense to use lotsize as a predictor of price.

df.plot(subplots=True)

Housing Data Visualization

Looking at the data, it looks like a decent guess to think lotsize might forecast price.

Now, lets set up our dataset to get our training and testing data ready.

X = (dataset['lotsize'])
y = (dataset['price'])
    
X_train = X[X.index < 400]
y_train = y[y.index < 400]              
    
X_test = X[X.index >= 400]    
y_test = y[y.index >= 400]

In the above, we set X and y for the random forest regressor and then set our training and test data. For training data, we are going to take the first 400 data points to train the random forest and then test it on the last 146 data points.

Now, let’s run our random forest regression model.  First, we need to import the Random Forest Regressor from sklearn:

from sklearn.ensemble.forest import RandomForestRegressor

And now….let’s run our Random Forest Regression and see what we get.

# build our RF model
RF_Model = RandomForestRegressor(n_estimators=100,
                                 max_features=1, oob_score=True)
# let's get the labels and features in order to run our 
# model fitting
labels = y_train#[:, None]
features = X_train[:, None]
# Fit the RF model with features and labels.
rgr=RF_Model.fit(features, labels)
# Now that we've run our models and fit it, let's create
# dataframes to look at the results
X_test_predict=pd.DataFrame(
    rgr.predict(X_test[:, None])).rename(
    columns={0:'predicted_price'}).set_index('predicted_price')
X_train_predict=pd.DataFrame(
    rgr.predict(X_train[:, None])).rename(
    columns={0:'predicted_price'}).set_index('predicted_price')
# combine the training and testing dataframes to visualize
# and compare.
RF_predict = X_train_predict.append(X_test_predict)

Let’s visualize the price and the predicted_price.

df[['price', 'predicted_price']].plot()

price vs predicted price

That’s really not a bad outcome for a wild guess that lotsize predicts price. Visually, it looks pretty good (although there are definitely errors).

Let’s look at the base level error. First, a quick plot of the ‘difference’ between the two.

df['diff']=df.predicted_price - df.price
df['diff'].plot(kind='bar')

Price vs Predicted Difference

There are some very large errors in there.  Let’s look at some values like R-Squared and Mean Squared Error. First, lets import the appropriate functions from sklearn.

from sklearn.metrics import r2_score

Now, lets look at R-Squared:

RSquared = r2_score(y_train[:, None], X_train_predict.reset_index().values)

R-Squared is 0.6976…or basically 0.7.  That’s not great but not terribly bad either for a random guess. A value of 0.7 (or 70%) tells you that roughly 70% of the variation of the ‘signal’ is explained by the variable used as a predictor.  That’s really not bad in the grand scheme of things.

I could go on with other calculations for error but the point of this post isn’t to show ‘accuracy’ but to show ‘process’ on how how to use Random Forest for forecasting.

Looks for more posts on using random forests for forecasting.


If you want a very good deep-dive into using Random Forest and other statistical methods for prediction, take a look at The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Amazon Affiliate link)

Book Review – Machine Learning With Random Forests And Decision Trees by Scott Hartshorn

Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some PythonI just finished reading Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some Python (amazon affiliate link).

The short review

This is a great introductory book for anyone looking to learn more about Random Forests and Decision Trees. You won’t be an expert after reading this book, but you’ll understand the basic theory and and how to implement random forests in python.

The long(ish) review

This is a short book – only 76 pages. But…those 76 pages are full of good, introductory information on Random Forests and Decision Trees.  Even though I’ve been using random forests and other machine learning approaches in python for years, I can easily see value for people that are just starting out with machine learning and/or random forests. That said, there were a few things in the book that I had either forgotten or didn’t know (Entropy Criteria for example).

While the entire book is excellent, the section on Feature Importance is the best in the book.  This section provides a very good description of the ‘why’ and the ‘how’ of feature importance (and therefore, feature selection) for use in random forests and decision trees.  There are some very good points made in this section regarding how to get started with feature selection and cross validation.

Additionally, the book provides a decent overview of the idea of ‘out-of-sample’ (or ‘Out-of-bag’) data.  I’m a huge believer in keeping some data out of your initial training data set to use for validation after you’ve built your models.

If you’re looking for a good introductory book on random forests and decision trees, pick this one up ( (amazon affiliate link)) …its only $2.99 for the kindle version.  Like I mentioned earlier, this book won’t make you an expert but it will provide a solid grounding to get started on the topic of random forests, decision trees and machine learning.

One negative comment I have on this book is that there is very little python in the book. The book isn’t marketed as strictly a python book, but I would have expected a bit more python in the book to help drive home some of the theory with runnable code. That said, this is a very small negative to the book overall.