# Search Results for: machine learning

## Forecasting Time Series Data using Autoregression

This is (yet) another post on forecasting time series data (you can find all the forecasting posts here).  In this post, we are going to talk about Autoregression models and how you might be able to apply them to forecasting time series problems.

Before we get into the forecasting time series , let’s talk a bit about autoregression models as well as some of the steps you need to take before you dive into using them when using them in forecasting time series data. You can jump over to view my jupyter notebook (simplified without comments) here.

### Autoregression vs Linear Regression

Autoregression modeling is a modeling technique used for time series data that assumes linear continuation of the series so that previous values in the time series can be used to predict futures values.  Some of you may be thinking that this sounds just like a linear regression – it sure does sound that way and is – in general – the same idea with additional features of the model that includes the idea of ‘lag variables’.

With a linear regression model, you’re taking all of the previous data points to build a model to predict a future data point using a simple linear model. The simple linear regression model is explained in much more detail here. An example of a linear model can be found below:

`y = a + b*X`

where a and b are variables found during the optimization/training process of the linear model.

With the autoregression model, your’e using previous data points and using them to predict future data point(s) but with multiple lag variables. Autocorrelation and autoregression are discussed in more detail here. An example of an autoregression model can be found below:

`y = a + b1*X(t-1) + b2*X(t-2) + b3*X(t-3)`

where a, b1, b2 and b3 are variables found during the training of the model and X(t-1), X(t-2) and X(t-3) are input variables at previous times within the data set.

The above is not nearly enough statistical background to truly understand linear and autoregression models, but I hope it gets you some basic understanding of how the two approaches differ.  Now, let’s dig into how to implement this with python.

### Forecasting Time Series with Autoregression

For this type of modeling, you need to be aware of the assumptions that are made prior to beginning working with data and autoregression modeling.

Assumptions:

• The previous time step(s) is useful in predicting the value at the next time step (dependance between values)
• Your data is stationary. A time series is stationary if is mean (and/or variance) is constant over time. There are other statistical properties to look at as well, but looking at the mean is usually the fastest/easiest.

If your time series data isn’t stationary, you’ll need to make it that way with some form of trend and seasonality removal (we’ll talk about that shortly).   If your time series data values are independent of each other, autoregression isn’t going to be a good forecasting method for that series.

Lets get into some code and some actual ‘doing’ rather than ‘talking’.

For this example, I’m going to use the retail sales data that I’ve used in the past.  Let’s load the data and take a look at the plot.

```### Initial imports to get started.
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline

plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')
sales_data['date']=pd.to_datetime(sales_data['date'])
sales_data.set_index('date', inplace=True)
sales_data.plot()```

Nothing fancy here…just simple pandas loading and plotting (after the standard imports for this type of thing).

The plot looks like the following: Let’s check for dependance (aka, correlation) – which is the first assumption for autoregression models. A visual method for checking correlation is to use pandas `lag_plot()` function to see how well the values of the original sales data are correlated with each other. If they are highly correlated, we’ll see a fairly close grouping of datapoints that align along some point/line on the plot.

`pd.tools.plotting.lag_plot(sales_data['sales'])` Because we don’t have many data points, this particular `lag_plot()` doesn’t look terribly convincing, but there is some correlation in there (along with some possible outliers).

A great example of correlated values can be seen in the below `lag_plot()` chart. These are taken from another project I’m working on (and might write up in another post). Like good data scientists/statisticians, we don’t want to just rely on a visual representation of correlation though, so we’ll use the idea of autocorrelation plots to look at correlations of our data.

Using pandas, you can plot an autocorrelation plot using this command:

`pd.tools.plotting.autocorrelation_plot(sales_data['sales'])`

The resulting chart contains a few lines on it separate from the autocorrelation function. The dark horizontal line at zero just denotes the zero line, the lighter full horizontal lines is the 95% confidence level and the dashed horizontal lines are 99% confidence levels, which means that correlations are more significant if they occur at those levels. From the plot above, we can see there’s some significant correlation between t=1 and t=12 (roughly) with significant decline in correlation after that timeframe.  Since we are looking at monthly sales data, this seems to make sense with correlations falling off at the start of the new fiscal year.

We can test this concept by checking the pearson correlation of the sales data with lagged values using the approach below.

`sales_data['sales'].corr(sales_data['sales'].shift(12))`

We used ’12’ above because that looked to be the highest correlation value from the autocorrelation chart. The output of the above command gives us a correlation value of `0.97` which is quite high (and actually almost too high for my liking, but it is what it is).

Now, let’s take a look at stationarity.  I can tell you just from looking at that chart that we have a non-stationary dataset due to the increasing trend from lower left to upper right as well as some seasonality (you can see large spikes at roughly the same time within each year).  There are plenty of tests that you can do to determine if seasonality / trend exist a time series, but for the purpose of this example, I’m going to do a quick/dirty plot to see trend/seasonality using the `seasonal_decompose()` method found in the `statsmodels` library.

```from statsmodels.tsa.seasonal import seasonal_decompose

Note: In the above code, we are assigning `decomposed.plot()` to `x`. If you don’t do this assignment, the plot is shown in the jupyter notebook. If anyone knows why this is the case, let me know. Until I figure out why, I’ve just been doing it this way.

The resulting plot is below. Now we know for certain that we have a time series that has a trend (2nd panel from top) and has seasonality (third panel from top).  Now what?  Let’s make it stationary by removing/reducing trend and seasonality.

For the purposes of this particular example, I’m just going to use the quick/dirty method of differencing to get a more stationary model.

`sales_data['stationary']=sales_data['sales'].diff()`

Plotting this new set of data gets us the following plot. Running `seasonal_decompose()` on this new data gives us: From this new decomposed plot, we can see that there’s still some trend and even some seasonality, which is unfortunate because it means we’d need to take a look at other methods to truly remove trend and seasonality from this particular data series, but for this example, I’m going to play dumb and say that its good enough and keep going (and in reality, it might be good enough — or it might not be good enough).

### Forecasting Time Series Data – Now on to the fun stuff!

Alright – now that we know our data fits our assumptions, at least well enough for this example. For this, we’ll use the `AR()` model in `statsmodels` library. I’m using this particular model becasue it auto-selects the lag value for modeling, which can simplify things. Note: this may not be the ideal approach, but is a good approach when first starting this type of work.

```from statsmodels.tsa.ar_model import AR
#create train/test datasets
X = sales_data['stationary'].dropna()
train_data = X[1:len(X)-12]
test_data = X[X[len(X)-12:]]
#train the autoregression model
model = AR(train_data)
model_fitted = model.fit()
```

In the above, we are simply creating a testing and training dataset and then creating and fitting our `AR()` model. Once you’ve fit the model, you can look at the chosen lag and parameters of the model using some simple print statements.

```print('The lag value chose is: %s' % model_fitted.k_ar)
The lag value chose is: 10
print('The coefficients of the model are:\n %s' % model_fitted.params)
The coefficients of the model are:
const             7720.952626
L1.stationary       -1.297636
L2.stationary       -1.574980
L3.stationary       -1.403045
L4.stationary       -1.123204
L5.stationary       -0.472200
L6.stationary       -0.014586
L7.stationary        0.564099
L8.stationary        0.792080
L9.stationary        0.843242
L10.stationary       0.395546```

If we look back at our autocorrelation plot, we can see that the lag value of 10 is where the line first touches the 95% confidence level, which is usually the way you’d select the lag value when you first run autoregression models if you were selecting things manually, so the selection makes sense.

Now, let’s make some forecasts and see how they compare to actuals.

```# make predictions
predictions = model_fitted.predict(
start=len(train_data),
end=len(train_data) + len(test_data)-1,
dynamic=False)
# create a comparison dataframe
compare_df = pd.concat(
[sales_data['stationary'].tail(12),
predictions], axis=1).rename(
columns={'stationary': 'actual', 0:'predicted'})
#plot the two values
compare_df.plot()```

In this bit of code, we’ve made predictions and then combined the prediction values with the ‘test’ data from the `sales_data` dataframe. That’s really not a bad model at it shows trend and movements (high/lows, etc) well but doesn’t quite get the extreme values.   Let’s check our root mean square error.

```from sklearn.metrics import r2_score
r2 = r2_score(sales_data['stationary'].tail(12), predictions)```

This gives us a root mean square value of `0.64`, which isn’t terrible but there is room for improvement here.

One thing to note about `statsmodels AR()` libary is that it makes it difficult to use this in on ‘online’ fashion (e.g., train a model and then add new data points as they come in). You’d need to either retrain your model based on the new datapoint added or just save the coefficients from the model and predict your own values as needed.

I hope this has been a good introduction of forecasting time series data using autoregression in python. A always, if you have any questions or comments, leave them in the comment section or contact me.

Note: If you have some interest in learning more about determining stationarity and other methods for eliminating trend and seasonality beyond just differencing, let me know and i’ll put another post up that talks about those things in detail.

### Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me.

## Local Interpretable Model-agnostic Explanations – LIME in Python

When working with classification and/or regression techniques, its always good to have the ability to ‘explain’ what your model is doing. Using Local Interpretable Model-agnostic Explanations (LIME), you now have the ability to quickly provide visual explanations of your model(s).

Its quite easy to throw numbers or content into an algorithm and get a result that looks good. We can test for accuracy and feel confident that the classifier and/or model is ‘good’…but can we describe what the model is actually doing to other users? A good data scientist spends some of their time making sure they have reasonable explanations for what the model is doing and why the results are what they are.

There’s always been a focus on ‘trust’ in any type of modeling methodology but with machine learning and deep learning, many people feel like the black-box approach taken with these methods isn’t as trustworthy as other methods.  This topic was addressed in a paper titled Why Should I Trust You?”: Explaining the Predictions of Any Classifier, which proposes the concept of Local Interpretable Model-agnostic Explanations (LIME). According to the paper, LIME is ‘an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.’

I’ve used the LIME approach a few times in recent projects and really like the idea. It breaks down the modeling / classification techniques and output into a form that can be easily described to non-technical people.  That said, LIME isn’t a replacement for doing your job as a data scientist, but it is another tool to add to your toolbox.

To implement LIME in python, I use this LIME library written / released by one of the authors the above paper.

I thought it might be good to provide a quick run-through of how to use this library. For this post, I’m going to mimic “Using lime for regression” notebook the authors provide, but I’m going to provide a little more explanation.

The full notebook is available in my repo here.

### Getting started with Local Interpretable Model-agnostic Explanations (LIME)

Before you get started, you’ll need to install Lime.

`pip install lime`

Next, let’s import our required libraries.

```from sklearn.datasets import load_boston
import sklearn.ensemble
import numpy as np
from sklearn.model_selection import train_test_split
import lime
import lime.lime_tabular```

Let’s load the sklearn dataset called ‘boston’. This data is a dataset that contains house prices that is often used for machine learning regression examples.

`boston = load_boston()`

Before we do much else, let’s take a look at the description of the dataset to get familiar with it.  You can do this by running the following command:

`print boston['DESCR']`

The output is:

```Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- TAX      full-value property-tax rate per \$10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in \$1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International
Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)```

Now that we have our data loaded, we want to build a regression model to forecast boston housing prices. We’ll use random forest for this to follow the example by the authors.

First, we’ll set up the RF Model and then create our training and test data using the train_test_split module from sklearn. Then, we’ll fit the data.

```rf = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)
train, test, labels_train, labels_test = train_test_split(boston.data, boston.target, train_size=0.80)
rf.fit(train, labels_train)```

Now that we have a Random Forest Regressor trained, we can check some of the accuracy measures.

`print('Random Forest MSError', np.mean((rf.predict(test) - labels_test) ** 2))`

Tbe MSError is: 10.45. Now, let’s look at the MSError when predicting the mean.

`print('MSError when predicting the mean', np.mean((labels_train.mean() - labels_test) ** 2))`

From this, we get 80.09.

Without really knowing the dataset, its hard to say whether they are good or bad.  Since we are really most interested in looking at the LIME approach, we’ll move along and assume these are decent errors.

To implement LIME, we need to get the categorical features from our data and then build an ‘explainer’. This is done with the following commands:

```categorical_features = np.argwhere(
np.array([len(set(boston.data[:,x]))
for x in range(boston.data.shape)]) <= 10).flatten()```

and the explainer:

```explainer = lime.lime_tabular.LimeTabularExplainer(train,
feature_names=boston.feature_names,
class_names=['price'],
categorical_features=categorical_features,
verbose=True, mode='regression')```

Now, we can grab one of our test values and check out our prediction(s). Here, we’ll grab the 100th test value and check the prediction and see what the explainer has to say about it.

```i = 100
exp = explainer.explain_instance(test[i], rf.predict, num_features=5)
exp.show_in_notebook(show_table=True)```

So…what does this tell us?

It tells us that the 100th test value’s prediction is 21.16 with the “RAD=24” value providing the most positive valuation and the other features providing negative valuation in the prediction.

For regression, this isn’t quite as interesting (although it is useful). The LIME approach shows much more benefit (at least to me) when performing classification.

As an example, if you are trying to classify plants as edible or poisonous, LIME’s explanation is much more useful. Here’s an example from the authors.

Take a look at LIME when you have some time. Its a good library to add to your toolkit, especially if you are doing a lot of classification work. It makes it much easier to ‘explain’ what the model is doing.

## Installing python on Windows

Note: Enthought Canopy is End-of-Life.  Rather than re-write this piece, I’ll just point readers to the Enthought End-of-Life note for more information on how to move to the new support version(s). When time permits, i’ll write up another post describing the installation process.

If you’ve done any work with python on Windows, you may be cringing right now at the thought of trying to do any type of python development work on the platform.  Have no fear though…there is hope for python developers on Windows, especially if you are only going to be using python for data analysis, machine learning, etc and not doing any major web development work (with flask, django, etc). In this post, I describe the steps necessary for installing python on Windows.

There’s really only one method for using / installing python on windows that is convenient and works for 99.9% of the people on Windows who are focused on scientific computing — downloading Enthought Canopy orAnaconda and installing it. For those of you getting started with data analytics, Canopy gets you started faster and makes it very easy to get modules like panda, numpy, scipy, etc installed and configured (in most cases, these are already installed when Canopy is installed).

For those of you running on Mac or Linux, you can also install Canopy for your platforms. I personally don’t use Canopy on the Mac or Linux platform, but only because I prefer to manage things a bit differently on those platforms. There’s nothing wrong with using Canopy on Mac or Linux, I just prefer not to.

### Installing Python on Windows using Canopy

For the purposes of this post, we are going to install Canopy(accurate as of November 2016).

• Step 1 – visit the Enthought Canopy website and click the “Get Canopy” button.
• Click the “Download Canopy” button. A web form will pop up asking for information…you can ignore that. Your download has started. Note: Canopy is available in 64-bit and 32-bit versions. I recommend the 64-bit if you are on a modern computer / operating system. • Once your download completes, run the executable to begin the installation process. A wizard will be displayed…hit “next” through the wizard and install the software. Once installation is complete, the final screen (see below) will have a ‘finish’ button and a ‘Launch Canopy when setup exits” checkbox. Leave the checkbox selected and click “finish” to complete the installation and launch Canopy. • The first  time you run Canopy, you will be presented with an ‘environment’ window (see below).  You can leave this at the default or select another location to store your environment information. I suggest leaving it at default to begin with. Click “Continue” to begin using Canopy. • The first time you load Canopy, it will take some time to load the various modules into memory and setting up your Canopy environment. Each time after this first start, the platform should load up fairly quickly.
• Once Canopy completes loading your environment for the first time, you’ll be asked if you want to make Canopy your default Python environment. Select “Yes” and click “Start using Canopy”.  If you select “no”, you will have to do a some manual configuration to begin using Canopy. • When Canopy starts, you’ll see the window below. • You now have Canopy installed and ready to go.  To start programming, click the ‘Editor’ button and Canopy will load up an editor to you can begin work. Below is a screenshot of the editor window. Check out the other posts on this website for more information on how to get started actually DOING something with python.