Category: time series Stationary Data Tests for Time Series Forecasting

I wasn’t planning on making a ‘part 2’ to the Forecasting Time Series Data using Autoregression post from last week, but I really wanted to show how to use more advanced tests to check for stationary data. Additionally, I wanted to use a new dataset that I ran across on Kaggle for energy consumption at an hourly level (find the dataset here).  For this example, I’m going to be using the `DEOK_hourly` dataset (i’ve added it to my git repo here).  You can follow along with the jupyter notebook here.

In this post, I’m going to follow the same approach that I took in the previous one – using autoregression to forecast time series data after checking to ensure the data is stationary.

Checking for Stationary data

So, what do we need to do to check for stationary data?  We can do the following:

• Plot the data – this is the first step and often will provide a great deal of information about your data. Regardless of the data you’re using or the steps you take afterwards, this should always be the first step in your process.
• Statistics Summaries and Tests  – There are a plethora of statistical tests that you can / should run but a quick summary of your data is probably the best thing to do.  Additionally, you can run tests like the Dickey-Fuller test to help understand your data and its stationarity.

Let’s plot our data first and take a look at a couple different plots. First, let’s get our imports taken care of.

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Next, let’s load our data and plot the time series.

data['Datetime']=pd.to_datetime(data['Datetime'])
data.set_index('Datetime', inplace=True) Looking at the data, it looks pretty stationary. There’s no real trend in the time series but there seems to be something that might be seasonality, so we’ll dig deeper into the data.  Let’s plot a histogram to see what the underlying distribution looks like.

data['DEOK_MW'].hist() Looks Gaussian with a bit of a long tail skew toward the right. From this histogram, I’m pretty confident that we have a stationary dataset otherwise we’d see something much less ‘bell-shaped’ due to trending and/or seasonality (e.g., we’d see more data plotted to the left or right).

Now, let’s look at some statistical tests. A simple one that you can use is to look at the mean and variance of multiple sections of the data and compare them. If they are similar, your data is most likely stationary.

There are many different ways to split the data for this check, but one way I like to do this is to follow the approach highlighted here.

one, two, three = np.split(
data['DEOK_MW'].sample(
frac=1), [int(.25*len(data['DEOK_MW'])),
int(.75*len(data['DEOK_MW']))])

The above code creates three new series. I randomly selected 25% for series one and 75% for the two and three – but you could create them of equal length if you wanted. I like making them different sizes just for a bit of extra randomness to the test.

Next, we’ll look at the means and variances of each series to see what they look like. Remember, if the data is stationary, the means/variances should be similar.

mean1, mean2, mean3 = one.mean(), two.mean(), three.mean()
var1, var2, var3 = one.var(), two.var(), three.var()

print mean1, mean2, mean3
print var1, var2, var3

The output of this is:

3093.27497575 3107.45445099 3112.20124697
353154.655416 363558.421407 358899.692558

Not great formatting, but you can quickly see that the means and variances are similar, pointing to stationary data.

Now that you know how to find stationarity using some plots and some basic stats, you should know that the above tests can be fooled sometimes, especially since they make assumptions about your data. So…don’t rely on these only…they’re a quick way to see what you have without having to pull out the big guns and run things like the Dickey-Fuller test.

Dickey-Fuller Test for Stationarity

Officially, this is called the ‘augmented Dickey-Fuller test’, but most folks just say ‘Dickey-Fuller’ when talking about it.  This is a test that tests the null hypothesis that a unit root is present in time series data.    To make things a bit more clear, this test is checking for stationarity or non-stationary data.  The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary. If the null hypothesis is rejected, then the alternate can be considered valid (e.g., the data is stationary).  You can read more about the test here if interested.

When you run the test, you’ll get an ADF value and a p-value. The ADF number should be a negative number and the p-value should be beneath a certain threshold value (e.g., 1% or 5%, etc) for a confidence level. For this example, we’ll use 5% (or 95% confidence level), so if the p-value is greater than 0.05 then we say we fail to reject the null hypothesis because the data has a unit root and is non-stationary.  If the p-value is less than or equal to 0.05, we can say we reject the null hypothesis because the data does not have a unit root and is stationary.

Let’s run the Augmented Dickey-Fuller test and see what we see.  The statsmodels library has a function called adfuller to make it easy for us to run this test.

In this code, we import the adfuller library from the statsmodels library and then run our data through the test.  The full output of the test is:

(-14.913267801069782,
1.4477674072055658e-27,
57,
57681,
{'1%': -3.4304633751328555,
'10%': -2.5667966716717614,
'5%': -2.8615901096273602},
669611.23911962728)

The ADF value is the first value in the result and the p-value is the 2nd.  The ‘1%’, ‘10%’ and ‘5%’ values are the critical values for 99%, 90% and 95% confidence levels.

Let’s look specifically at our ADF and p-values.

We get these results:

p-value = 1.44776740721e-27

Our p-value is definitely less than 0.5 and is even less than 0.01 so we can say with pretty good confidence that we can reject the null (unit root, non-stationary data) and can assume our data is stationary. Additionally, our ADF is much less than our 1% confidence value of -3.43, so we have another confirmation that we can reject the null.

Now that we know its stationary, we need to see if its correlated (remember there’s an assumption of dependance / correlation for autoregression). Let’s look at a lagplot.

pd.tools.plotting.lag_plot(data['DEOK_MW']) No question…that data is correlated somehow.

Now…we can actually DO something with the data! Let’s run a forecast on it now using autoregression.

Forecasting Time Series Data using Autoregression

We know our data is stationary and correlated (or at least we *believe* it is based on our tests). Let’s run our autoregression forecast and see what we see.

For this, we’ll use a different approach than we did before sine we have much more data. We’ll use the same training/testing data creation that we used in the previous post and create a 12 period testing dataset and prediction dataset (i.e., we are going to predict the ‘next’ 12 readings).

#create train/test datasets
X = data['DEOK_MW'].dropna()

train_data = X[1:len(X)-12]
test_data = X[len(X)-12:]

Now, we’ll run our the AR() model.

from statsmodels.tsa.ar_model import AR
from sklearn.metrics import r2_score

#train the autoregression model
model = AR(train_data)
model_fitted = model.fit()

print('The lag value chose is: %s' % model_fitted.k_ar)

The lag value chosen for this model is 59.  Now, let’s make some predictions and check the accuracy.

# make predictions
predictions = model_fitted.predict(
start=len(train_data),
end=len(train_data) + len(test_data)-1,
dynamic=False)

# create a comparison dataframe
compare_df = pd.concat(
[data['DEOK_MW'].reset_index().tail(12),
predictions], axis=1).rename(
columns={'DEOK_MW': 'actual', 0:'predicted'})
compare_df=compare_df[['actual', 'predicted']].dropna()

In the above, we are making predictions and then creating a dataframe to compare the ‘predicted’ values versus the ‘actual’ values. Plotting these values together gives us the following. Not a bad forecast with the cycle being pretty good but magnitude being a bit off. Let’s take a look at r-squared.

r2 = r2_score(compare_df.actual, compare_df.predicted)

Our r-squared is 0.76, which is pretty good for a first pass at this data and forecasting, especially given the fact that our lag is auto-selected for us.

Hopefully this helps shed some light on how to use statistical tests and plots to check for stationarity when running forecasts with time series data.

Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me. Forecasting Time Series Data using Autoregression

This is (yet) another post on forecasting time series data (you can find all the forecasting posts here).  In this post, we are going to talk about Autoregression models and how you might be able to apply them to forecasting time series problems.

Before we get into the forecasting time series , let’s talk a bit about autoregression models as well as some of the steps you need to take before you dive into using them when using them in forecasting time series data. You can jump over to view my jupyter notebook (simplified without comments) here.

Autoregression vs Linear Regression

Autoregression modeling is a modeling technique used for time series data that assumes linear continuation of the series so that previous values in the time series can be used to predict futures values.  Some of you may be thinking that this sounds just like a linear regression – it sure does sound that way and is – in general – the same idea with additional features of the model that includes the idea of ‘lag variables’.

With a linear regression model, you’re taking all of the previous data points to build a model to predict a future data point using a simple linear model. The simple linear regression model is explained in much more detail here. An example of a linear model can be found below:

y = a + b*X

where a and b are variables found during the optimization/training process of the linear model.

With the autoregression model, your’e using previous data points and using them to predict future data point(s) but with multiple lag variables. Autocorrelation and autoregression are discussed in more detail here. An example of an autoregression model can be found below:

y = a + b1*X(t-1) + b2*X(t-2) + b3*X(t-3)

where a, b1, b2 and b3 are variables found during the training of the model and X(t-1), X(t-2) and X(t-3) are input variables at previous times within the data set.

The above is not nearly enough statistical background to truly understand linear and autoregression models, but I hope it gets you some basic understanding of how the two approaches differ.  Now, let’s dig into how to implement this with python.

Forecasting Time Series with Autoregression

For this type of modeling, you need to be aware of the assumptions that are made prior to beginning working with data and autoregression modeling.

Assumptions:

• The previous time step(s) is useful in predicting the value at the next time step (dependance between values)
• Your data is stationary. A time series is stationary if is mean (and/or variance) is constant over time. There are other statistical properties to look at as well, but looking at the mean is usually the fastest/easiest.

If your time series data isn’t stationary, you’ll need to make it that way with some form of trend and seasonality removal (we’ll talk about that shortly).   If your time series data values are independent of each other, autoregression isn’t going to be a good forecasting method for that series.

Lets get into some code and some actual ‘doing’ rather than ‘talking’.

For this example, I’m going to use the retail sales data that I’ve used in the past.  Let’s load the data and take a look at the plot.

### Initial imports to get started.

import pandas as pd
import matplotlib.pylab as plt

%matplotlib inline

plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

sales_data['date']=pd.to_datetime(sales_data['date'])
sales_data.set_index('date', inplace=True)

sales_data.plot()

Nothing fancy here…just simple pandas loading and plotting (after the standard imports for this type of thing).

The plot looks like the following: Let’s check for dependance (aka, correlation) – which is the first assumption for autoregression models. A visual method for checking correlation is to use pandas lag_plot() function to see how well the values of the original sales data are correlated with each other. If they are highly correlated, we’ll see a fairly close grouping of datapoints that align along some point/line on the plot.

pd.tools.plotting.lag_plot(sales_data['sales']) Because we don’t have many data points, this particular lag_plot() doesn’t look terribly convincing, but there is some correlation in there (along with some possible outliers).

A great example of correlated values can be seen in the below lag_plot() chart. These are taken from another project I’m working on (and might write up in another post). Like good data scientists/statisticians, we don’t want to just rely on a visual representation of correlation though, so we’ll use the idea of autocorrelation plots to look at correlations of our data.

Using pandas, you can plot an autocorrelation plot using this command:

pd.tools.plotting.autocorrelation_plot(sales_data['sales'])

The resulting chart contains a few lines on it separate from the autocorrelation function. The dark horizontal line at zero just denotes the zero line, the lighter full horizontal lines is the 95% confidence level and the dashed horizontal lines are 99% confidence levels, which means that correlations are more significant if they occur at those levels. From the plot above, we can see there’s some significant correlation between t=1 and t=12 (roughly) with significant decline in correlation after that timeframe.  Since we are looking at monthly sales data, this seems to make sense with correlations falling off at the start of the new fiscal year.

We can test this concept by checking the pearson correlation of the sales data with lagged values using the approach below.

sales_data['sales'].corr(sales_data['sales'].shift(12))

We used ’12’ above because that looked to be the highest correlation value from the autocorrelation chart. The output of the above command gives us a correlation value of 0.97 which is quite high (and actually almost too high for my liking, but it is what it is).

Now, let’s take a look at stationarity.  I can tell you just from looking at that chart that we have a non-stationary dataset due to the increasing trend from lower left to upper right as well as some seasonality (you can see large spikes at roughly the same time within each year).  There are plenty of tests that you can do to determine if seasonality / trend exist a time series, but for the purpose of this example, I’m going to do a quick/dirty plot to see trend/seasonality using the seasonal_decompose() method found in the statsmodels library.

from statsmodels.tsa.seasonal import seasonal_decompose

Note: In the above code, we are assigning decomposed.plot() to x. If you don’t do this assignment, the plot is shown in the jupyter notebook. If anyone knows why this is the case, let me know. Until I figure out why, I’ve just been doing it this way.

The resulting plot is below. Now we know for certain that we have a time series that has a trend (2nd panel from top) and has seasonality (third panel from top).  Now what?  Let’s make it stationary by removing/reducing trend and seasonality.

For the purposes of this particular example, I’m just going to use the quick/dirty method of differencing to get a more stationary model.

sales_data['stationary']=sales_data['sales'].diff()

Plotting this new set of data gets us the following plot. Running seasonal_decompose() on this new data gives us: From this new decomposed plot, we can see that there’s still some trend and even some seasonality, which is unfortunate because it means we’d need to take a look at other methods to truly remove trend and seasonality from this particular data series, but for this example, I’m going to play dumb and say that its good enough and keep going (and in reality, it might be good enough — or it might not be good enough).

Forecasting Time Series Data – Now on to the fun stuff!

Alright – now that we know our data fits our assumptions, at least well enough for this example. For this, we’ll use the AR() model in statsmodels library. I’m using this particular model becasue it auto-selects the lag value for modeling, which can simplify things. Note: this may not be the ideal approach, but is a good approach when first starting this type of work.

from statsmodels.tsa.ar_model import AR

#create train/test datasets
X = sales_data['stationary'].dropna()
train_data = X[1:len(X)-12]
test_data = X[X[len(X)-12:]]

#train the autoregression model
model = AR(train_data)
model_fitted = model.fit()

In the above, we are simply creating a testing and training dataset and then creating and fitting our AR() model. Once you’ve fit the model, you can look at the chosen lag and parameters of the model using some simple print statements.

print('The lag value chose is: %s' % model_fitted.k_ar)

The lag value chose is: 10

print('The coefficients of the model are:\n %s' % model_fitted.params)

The coefficients of the model are:
const             7720.952626
L1.stationary       -1.297636
L2.stationary       -1.574980
L3.stationary       -1.403045
L4.stationary       -1.123204
L5.stationary       -0.472200
L6.stationary       -0.014586
L7.stationary        0.564099
L8.stationary        0.792080
L9.stationary        0.843242
L10.stationary       0.395546

If we look back at our autocorrelation plot, we can see that the lag value of 10 is where the line first touches the 95% confidence level, which is usually the way you’d select the lag value when you first run autoregression models if you were selecting things manually, so the selection makes sense.

Now, let’s make some forecasts and see how they compare to actuals.

# make predictions
predictions = model_fitted.predict(
start=len(train_data),
end=len(train_data) + len(test_data)-1,
dynamic=False)

# create a comparison dataframe
compare_df = pd.concat(
[sales_data['stationary'].tail(12),
predictions], axis=1).rename(
columns={'stationary': 'actual', 0:'predicted'})

#plot the two values
compare_df.plot()

In this bit of code, we’ve made predictions and then combined the prediction values with the ‘test’ data from the sales_data dataframe. That’s really not a bad model at it shows trend and movements (high/lows, etc) well but doesn’t quite get the extreme values.   Let’s check our root mean square error.

from sklearn.metrics import r2_score

r2 = r2_score(sales_data['stationary'].tail(12), predictions)

This gives us a root mean square value of 0.64, which isn’t terrible but there is room for improvement here.

One thing to note about statsmodels AR() libary is that it makes it difficult to use this in on ‘online’ fashion (e.g., train a model and then add new data points as they come in). You’d need to either retrain your model based on the new datapoint added or just save the coefficients from the model and predict your own values as needed.

I hope this has been a good introduction of forecasting time series data using autoregression in python. A always, if you have any questions or comments, leave them in the comment section or contact me.

Note: If you have some interest in learning more about determining stationarity and other methods for eliminating trend and seasonality beyond just differencing, let me know and i’ll put another post up that talks about those things in detail.

Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me.

Forecasting Time Series data with Prophet – Part 4

This is the fourth in a series of posts about using Forecasting Time Series data with Prophet. The other parts can be found here:

In those previous posts, I looked at forecasting monthly sales data 24 months into the future using some example sales data that you can find here.

In this post, I want to look at the output of Prophet to see how we can apply some metrics to measure ‘accuracy’.  When we start looking at ‘accuracy’ of forecasts, we can really do a whole lot of harm by using the wrong metrics and the wrong data to measure accuracy.  That said, its good practice to always try to compare your predicted values with your actual values to see how well or poorly your model(s) are performing.

For the purposes of this post, I’m going to expand on the data in the previous posts. For this post we are using fbprophet version 0.2.1.  Also – we’ll need scikit-learn and scipy installed for looking at some metrics.

Note: While I’m using Prophet to generate the models, these metrics and tests for accuracy can be used with just about any modeling approach.

Since the majority of the work has been covered in Part 3, I’m going to skip down to the metrics section…you can see the entire code and follow along with the jupyter notebook here.

In the notebook, we’ve loaded the data. The visualization of the data looks like this: Our prophet model forecast looks like: Again…you can see all the steps in thejupyter notebook if you want to follow along step by step.

Now that we have a prophet forecast for this data, let’s combine the forecast with our original data so we can compare the two data sets.

metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()

The above line of code takes the actual forecast data ‘yhat’ in the forecast dataframe, sets the index to be ‘ds’ on both (to allow us to combine with the original data-set) and then joins these forecasts with the original data. lastly, we reset the indexes to get back to the non-date index that we’ve been working with (this isn’t necessary…just a step I took).

The new dataframe looks like this: You can see from the above, that the last part of the dataframe has “NaN” for ‘y’…that’s fine because we are only concerned about checking the forecast values versus the actual values so we can drop these “NaN” values.

metric_df.dropna(inplace=True)

Now, we have a dataframe with just the original data (in the ‘y’ column) and forecasted data (in the yhat column) to compare. Now, we are going to take a look at a few metrics.

Metrics for measuring modeling accuracy

If you ask 100 different statisticians, you’ll probably get at least 50 different answers on ‘the best’ metrics to use for measuring accuracy of models.  For most cases, using either R-Squared, Mean Squared Error and Mean Absolute Error (or a combo of them all) will get you a good enough measure of the accuracy of your model.

For me, I like to use R-Squared and Mean Absolute Error (MAE).  With these two measures, I feel like I can get a really good feel for how well (or poorly) my model is doing.

Python’s ScitKit Learn has some good / easy methods for calculating these values.  To use them, you’ll need to import them (and have scitkit-learn and scipy installed). If you don’t have scitkit-learn and scipy installed, you can do so with the following command:

pip install scikit-learn scipy

Now, you can import the metrics with the following command:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

To calculate R-Squared, we simply do the following:

r2_score(metric_df.y, metric_df.yhat)

For this data, we get an R-Squared value of 0.99.   Now…this is an amazing value…it can be interpreted to mean that 99% of the variance in this data is explained by the model. Pretty darn good (but also very very naive in thinking). When I see an R-Squared value like this, I immediately think that the model has been overfit.   If you want to dig into a good read on what R-Squared means and how to interpret it, check out this post.

Now, let’s take a look at MSE.

mean_squared_error(metric_df.y, metric_df.yhat)

The MSE turns out to be 11,129,529.44. That’s a huge value…an MSE of 11 million tells me this model isn’t all that great, which isn’t surprising given the low number of data points used to build the model.  That said, a high MSE isn’t a bad thing necessarily but it give you a good feel for the accuracy you can expect to see.

Lastly, let’s take a look at MAE.

mean_absolute_error(metric_df.y, metric_df.yhat)

For this model / data, the MAE turns out to be 2,601.15, which really isn’t all that bad. What that tells me is that for each data point, my average magnitude of error is roughly \$2,600, which isn’t all that bad when we are looking at sales values in the \$300K to \$500K range.  BTW – if you want to take a look at an interesting comparison of MAE and RMSE (Root Mean Squared Error), check out this post.

Hopefully this has been helpful.  It wasn’t the intention of this post to explain the intricacies of these metrics, but hopefully you’ve seen a bit about how to use metrics to measure your models. I may go into more detail on modeling / forecasting accuracies in the future at some point. Let me know if you have any questions on this stuff…I’d be happy to expand if needed.

Note: In the jupyter notebook,  I show the use of a new metrics library I found called ML-Metrics. Check it out…its another way to run some of the metrics.

If you want to learn more about time series forecating, here’s a few good books on the subject. These are Amazon links…I’d appreciate it if you used them if you purchase these books as the little bit of income that comes from these links helps pay for the server this blog runs on.