Search Results for: machine learning

Comparing Machine Learning Methods

When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project.  A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I’ll be comparing machine learning methods using a few different sklearn algorithms.  As always, you can find a jupyter notebook for this article on my github here and find other articles on this topic here.

I’ve used Jason Brownlee’s article from 2016 as the basis for this article…I wanted to expand a bit on what he did as well as use a different dataset. In this article, we’ll be using the Indian Liver Disease dataset (found here).

From the dataset page:

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The “Dataset” column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Let’s get started by setting up our imports that we’ll use.

import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Next, we’ll read in the data from the CSV file located in the local directory.

#read in the data
data = pd.read_csv('indian_liver_patient.csv')

If you do a head() of the dataframe, you’ll get a good feeling for the dataset.

Indian Liver Disease Data

We’ll use all columns except Gender for this tutorial. We could use gender by converting the gender to a numeric value (e.g., 0 for Male, 1 for Female) but for the purposes of this post, we’ll just skip this column.

data_to_use = data
del data_to_use['Gender']
data_to_use.dropna(inplace=True)

The ‘Dataset’ column is the value we are trying to predict…whether the user has liver disease or not so we’ll that as our “Y” and the other columns for our “X” array.

values = data_to_use.values
Y = values[:,9]
X = values[:,0:9]

Before we run our machine learning models, we need to set a random number to use to seed them. This can be any random number that you’d like it to be. Some people like to use a random number generator but for the purposes of this, I’ll just set it to 12 (it could just as easily be 1 or 3 or 1023 or any other number).

random_seed = 12

Now we need to set up our models that we’ll be testing out. We’ll set up a list of the models and give them each a name. Additionally, I’m going to set up the blank arrays/lists for the outcomes and the names of the models to use for comparison.

outcome = []
model_names = []
models = [('LogReg', LogisticRegression()), 
          ('SVM', SVC()), 
          ('DecTree', DecisionTreeClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LinDisc', LinearDiscriminantAnalysis()),
          ('GaussianNB', GaussianNB())]

We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.

for model_name, model in models:
    k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
    results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
    outcome.append(results)
    model_names.append(model_name)
    output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
    print(output_message)

The output from this loop is:

LogReg| Mean=0.718633 STD=0.058744
SVM| Mean=0.715124 STD=0.058962
DecTree| Mean=0.637568 STD=0.108805
KNN| Mean=0.651301 STD=0.079872
LinDisc| Mean=0.716878 STD=0.050734
GaussianNB| Mean=0.554719 STD=0.081961

From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.

fig = plt.figure()
fig.suptitle('Machine Learning Model Comparison')
ax = fig.add_subplot(111)
plt.boxplot(outcome)
ax.set_xticklabels(model_names)
plt.show()

Machine Learning Comparison

From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.

Book Review – Machine Learning With Random Forests And Decision Trees by Scott Hartshorn

Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some PythonI just finished reading Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some Python (amazon affiliate link).

The short review

This is a great introductory book for anyone looking to learn more about Random Forests and Decision Trees. You won’t be an expert after reading this book, but you’ll understand the basic theory and and how to implement random forests in python.

The long(ish) review

This is a short book – only 76 pages. But…those 76 pages are full of good, introductory information on Random Forests and Decision Trees.  Even though I’ve been using random forests and other machine learning approaches in python for years, I can easily see value for people that are just starting out with machine learning and/or random forests. That said, there were a few things in the book that I had either forgotten or didn’t know (Entropy Criteria for example).

While the entire book is excellent, the section on Feature Importance is the best in the book.  This section provides a very good description of the ‘why’ and the ‘how’ of feature importance (and therefore, feature selection) for use in random forests and decision trees.  There are some very good points made in this section regarding how to get started with feature selection and cross validation.

Additionally, the book provides a decent overview of the idea of ‘out-of-sample’ (or ‘Out-of-bag’) data.  I’m a huge believer in keeping some data out of your initial training data set to use for validation after you’ve built your models.

If you’re looking for a good introductory book on random forests and decision trees, pick this one up ( (amazon affiliate link)) …its only $2.99 for the kindle version.  Like I mentioned earlier, this book won’t make you an expert but it will provide a solid grounding to get started on the topic of random forests, decision trees and machine learning.

One negative comment I have on this book is that there is very little python in the book. The book isn’t marketed as strictly a python book, but I would have expected a bit more python in the book to help drive home some of the theory with runnable code. That said, this is a very small negative to the book overall.

 

 

Stationary Data Tests for Time Series Forecasting

I wasn’t planning on making a ‘part 2’ to the Forecasting Time Series Data using Autoregression post from last week, but I really wanted to show how to use more advanced tests to check for stationary data. Additionally, I wanted to use a new dataset that I ran across on Kaggle for energy consumption at an hourly level (find the dataset here).  For this example, I’m going to be using the `DEOK_hourly` dataset (i’ve added it to my git repo here).  You can follow along with the jupyter notebook here.

In this post, I’m going to follow the same approach that I took in the previous one – using autoregression to forecast time series data after checking to ensure the data is stationary.

Checking for Stationary data

So, what do we need to do to check for stationary data?  We can do the following:

  • Plot the data – this is the first step and often will provide a great deal of information about your data. Regardless of the data you’re using or the steps you take afterwards, this should always be the first step in your process.
  • Statistics Summaries and Tests  – There are a plethora of statistical tests that you can / should run but a quick summary of your data is probably the best thing to do.  Additionally, you can run tests like the Dickey-Fuller test to help understand your data and its stationarity.

Let’s plot our data first and take a look at a couple different plots. First, let’s get our imports taken care of.

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
 
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Next, let’s load our data and plot the time series.

data = pd.read_csv('DEOK_hourly.csv')
data['Datetime']=pd.to_datetime(data['Datetime'])
data.set_index('Datetime', inplace=True)

DEOK Time Series plot

Looking at the data, it looks pretty stationary. There’s no real trend in the time series but there seems to be something that might be seasonality, so we’ll dig deeper into the data.  Let’s plot a histogram to see what the underlying distribution looks like.

data['DEOK_MW'].hist()

DEOK Histogram

Looks Gaussian with a bit of a long tail skew toward the right. From this histogram, I’m pretty confident that we have a stationary dataset otherwise we’d see something much less ‘bell-shaped’ due to trending and/or seasonality (e.g., we’d see more data plotted to the left or right).

Now, let’s look at some statistical tests. A simple one that you can use is to look at the mean and variance of multiple sections of the data and compare them. If they are similar, your data is most likely stationary.

There are many different ways to split the data for this check, but one way I like to do this is to follow the approach highlighted here.

one, two, three = np.split(
        data['DEOK_MW'].sample(
        frac=1), [int(.25*len(data['DEOK_MW'])),
        int(.75*len(data['DEOK_MW']))])

The above code creates three new series. I randomly selected 25% for series one and 75% for the two and three – but you could create them of equal length if you wanted. I like making them different sizes just for a bit of extra randomness to the test.

Next, we’ll look at the means and variances of each series to see what they look like. Remember, if the data is stationary, the means/variances should be similar.

mean1, mean2, mean3 = one.mean(), two.mean(), three.mean()
var1, var2, var3 = one.var(), two.var(), three.var()
print mean1, mean2, mean3
print var1, var2, var3

The output of this is:

3093.27497575 3107.45445099 3112.20124697
353154.655416 363558.421407 358899.692558

Not great formatting, but you can quickly see that the means and variances are similar, pointing to stationary data.

Now that you know how to find stationarity using some plots and some basic stats, you should know that the above tests can be fooled sometimes, especially since they make assumptions about your data. So…don’t rely on these only…they’re a quick way to see what you have without having to pull out the big guns and run things like the Dickey-Fuller test.

Dickey-Fuller Test for Stationarity

Officially, this is called the ‘augmented Dickey-Fuller test’, but most folks just say ‘Dickey-Fuller’ when talking about it.  This is a test that tests the null hypothesis that a unit root is present in time series data.    To make things a bit more clear, this test is checking for stationarity or non-stationary data.  The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary. If the null hypothesis is rejected, then the alternate can be considered valid (e.g., the data is stationary).  You can read more about the test here if interested.

When you run the test, you’ll get an ADF value and a p-value. The ADF number should be a negative number and the p-value should be beneath a certain threshold value (e.g., 1% or 5%, etc) for a confidence level. For this example, we’ll use 5% (or 95% confidence level), so if the p-value is greater than 0.05 then we say we fail to reject the null hypothesis because the data has a unit root and is non-stationary.  If the p-value is less than or equal to 0.05, we can say we reject the null hypothesis because the data does not have a unit root and is stationary.

Let’s run the Augmented Dickey-Fuller test and see what we see.  The statsmodels library has a function called adfuller to make it easy for us to run this test.

from statsmodels.tsa.stattools import adfuller
adf_test = adfuller(data['DEOK_MW'])
print "ADF = " + str(adf_test[0])
print "p-value = " +str(adf_test[1])

In this code, we import the adfuller library from the statsmodels library and then run our data through the test.  The full output of the test is:

(-14.913267801069782,
 1.4477674072055658e-27,
 57,
 57681,
 {'1%': -3.4304633751328555,
  '10%': -2.5667966716717614,
  '5%': -2.8615901096273602},
 669611.23911962728)

The ADF value is the first value in the result and the p-value is the 2nd.  The ‘1%’, ‘10%’ and ‘5%’ values are the critical values for 99%, 90% and 95% confidence levels.

Let’s look specifically at our ADF and p-values.

print "ADF = " + str(adf_test[0])
print "p-value = " +str(adf_test[1])

We get these results:

ADF = -14.9132678011
p-value = 1.44776740721e-27

Our p-value is definitely less than 0.5 and is even less than 0.01 so we can say with pretty good confidence that we can reject the null (unit root, non-stationary data) and can assume our data is stationary. Additionally, our ADF is much less than our 1% confidence value of -3.43, so we have another confirmation that we can reject the null.

Now that we know its stationary, we need to see if its correlated (remember there’s an assumption of dependance / correlation for autoregression). Let’s look at a lagplot.

pd.tools.plotting.lag_plot(data['DEOK_MW'])

DEOK Lag Plot

No question…that data is correlated somehow.

Now…we can actually DO something with the data! Let’s run a forecast on it now using autoregression.

Forecasting Time Series Data using Autoregression

We know our data is stationary and correlated (or at least we *believe* it is based on our tests). Let’s run our autoregression forecast and see what we see.

For this, we’ll use a different approach than we did before sine we have much more data. We’ll use the same training/testing data creation that we used in the previous post and create a 12 period testing dataset and prediction dataset (i.e., we are going to predict the ‘next’ 12 readings).

#create train/test datasets
X = data['DEOK_MW'].dropna()
train_data = X[1:len(X)-12]
test_data = X[len(X)-12:]

Now, we’ll run our the AR() model.

from statsmodels.tsa.ar_model import AR
from sklearn.metrics import r2_score
#train the autoregression model
model = AR(train_data)
model_fitted = model.fit()
print('The lag value chose is: %s' % model_fitted.k_ar)

The lag value chosen for this model is 59.  Now, let’s make some predictions and check the accuracy.

# make predictions 
predictions = model_fitted.predict(
    start=len(train_data), 
    end=len(train_data) + len(test_data)-1, 
    dynamic=False)

# create a comparison dataframe
compare_df = pd.concat(
    [data['DEOK_MW'].reset_index().tail(12),
    predictions], axis=1).rename(
    columns={'DEOK_MW': 'actual', 0:'predicted'})
compare_df=compare_df[['actual', 'predicted']].dropna()

In the above, we are making predictions and then creating a dataframe to compare the ‘predicted’ values versus the ‘actual’ values. Plotting these values together gives us the following.

DEOK Actual vs Predicted

Not a bad forecast with the cycle being pretty good but magnitude being a bit off. Let’s take a look at r-squared.

r2 = r2_score(compare_df.actual, compare_df.predicted)

Our r-squared is 0.76, which is pretty good for a first pass at this data and forecasting, especially given the fact that our lag is auto-selected for us.

Hopefully this helps shed some light on how to use statistical tests and plots to check for stationarity when running forecasts with time series data.


Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me.


To learn more about Time Series Forecasting, I highly recommend the following books: