## Forecasting Time Series Data using Autoregression

This is (yet) another post on forecasting time series data (you can find all the forecasting posts here).  In this post, we are going to talk about Autoregression models and how you might be able to apply them to forecasting time series problems.

Before we get into the forecasting time series , let’s talk a bit about autoregression models as well as some of the steps you need to take before you dive into using them when using them in forecasting time series data. You can jump over to view my jupyter notebook (simplified without comments) here.

### Autoregression vs Linear Regression

Autoregression modeling is a modeling technique used for time series data that assumes linear continuation of the series so that previous values in the time series can be used to predict futures values.  Some of you may be thinking that this sounds just like a linear regression – it sure does sound that way and is – in general – the same idea with additional features of the model that includes the idea of ‘lag variables’.

With a linear regression model, you’re taking all of the previous data points to build a model to predict a future data point using a simple linear model. The simple linear regression model is explained in much more detail here. An example of a linear model can be found below:

where a and b are variables found during the optimization/training process of the linear model.

With the autoregression model, your’e using previous data points and using them to predict future data point(s) but with multiple lag variables. Autocorrelation and autoregression are discussed in more detail here. An example of an autoregression model can be found below:

where a, b1, b2 and b3 are variables found during the training of the model and X(t-1), X(t-2) and X(t-3) are input variables at previous times within the data set.

The above is not nearly enough statistical background to truly understand linear and autoregression models, but I hope it gets you some basic understanding of how the two approaches differ.  Now, let’s dig into how to implement this with python.

### Forecasting Time Series with Autoregression

For this type of modeling, you need to be aware of the assumptions that are made prior to beginning working with data and autoregression modeling.

Assumptions:

• The previous time step(s) is useful in predicting the value at the next time step (dependance between values)
• Your data is stationary. A time series is stationary if is mean (and/or variance) is constant over time. There are other statistical properties to look at as well, but looking at the mean is usually the fastest/easiest.

If your time series data isn’t stationary, you’ll need to make it that way with some form of trend and seasonality removal (we’ll talk about that shortly).   If your time series data values are independent of each other, autoregression isn’t going to be a good forecasting method for that series.

Lets get into some code and some actual ‘doing’ rather than ‘talking’.

For this example, I’m going to use the retail sales data that I’ve used in the past.  Let’s load the data and take a look at the plot.

Nothing fancy here…just simple pandas loading and plotting (after the standard imports for this type of thing).

The plot looks like the following:

Let’s check for dependance (aka, correlation) – which is the first assumption for autoregression models. A visual method for checking correlation is to use pandas `lag_plot()` function to see how well the values of the original sales data are correlated with each other. If they are highly correlated, we’ll see a fairly close grouping of datapoints that align along some point/line on the plot.

Because we don’t have many data points, this particular `lag_plot()` doesn’t look terribly convincing, but there is some correlation in there (along with some possible outliers).

A great example of correlated values can be seen in the below `lag_plot()` chart. These are taken from another project I’m working on (and might write up in another post).

Like good data scientists/statisticians, we don’t want to just rely on a visual representation of correlation though, so we’ll use the idea of autocorrelation plots to look at correlations of our data.

Using pandas, you can plot an autocorrelation plot using this command:

The resulting chart contains a few lines on it separate from the autocorrelation function. The dark horizontal line at zero just denotes the zero line, the lighter full horizontal lines is the 95% confidence level and the dashed horizontal lines are 99% confidence levels, which means that correlations are more significant if they occur at those levels.

From the plot above, we can see there’s some significant correlation between t=1 and t=12 (roughly) with significant decline in correlation after that timeframe.  Since we are looking at monthly sales data, this seems to make sense with correlations falling off at the start of the new fiscal year.

We can test this concept by checking the pearson correlation of the sales data with lagged values using the approach below.

We used ’12’ above because that looked to be the highest correlation value from the autocorrelation chart. The output of the above command gives us a correlation value of `0.97` which is quite high (and actually almost too high for my liking, but it is what it is).

Now, let’s take a look at stationarity.  I can tell you just from looking at that chart that we have a non-stationary dataset due to the increasing trend from lower left to upper right as well as some seasonality (you can see large spikes at roughly the same time within each year).  There are plenty of tests that you can do to determine if seasonality / trend exist a time series, but for the purpose of this example, I’m going to do a quick/dirty plot to see trend/seasonality using the `seasonal_decompose()` method found in the `statsmodels` library.

Note: In the above code, we are assigning `decomposed.plot()` to `x`. If you don’t do this assignment, the plot is shown in the jupyter notebook. If anyone knows why this is the case, let me know. Until I figure out why, I’ve just been doing it this way.

The resulting plot is below.

Now we know for certain that we have a time series that has a trend (2nd panel from top) and has seasonality (third panel from top).  Now what?  Let’s make it stationary by removing/reducing trend and seasonality.

For the purposes of this particular example, I’m just going to use the quick/dirty method of differencing to get a more stationary model.

Plotting this new set of data gets us the following plot.

Running `seasonal_decompose()` on this new data gives us:

From this new decomposed plot, we can see that there’s still some trend and even some seasonality, which is unfortunate because it means we’d need to take a look at other methods to truly remove trend and seasonality from this particular data series, but for this example, I’m going to play dumb and say that its good enough and keep going (and in reality, it might be good enough — or it might not be good enough).

### Forecasting Time Series Data – Now on to the fun stuff!

Alright – now that we know our data fits our assumptions, at least well enough for this example. For this, we’ll use the `AR()` model in `statsmodels` library. I’m using this particular model becasue it auto-selects the lag value for modeling, which can simplify things. Note: this may not be the ideal approach, but is a good approach when first starting this type of work.

In the above, we are simply creating a testing and training dataset and then creating and fitting our `AR()` model. Once you’ve fit the model, you can look at the chosen lag and parameters of the model using some simple print statements.

If we look back at our autocorrelation plot, we can see that the lag value of 10 is where the line first touches the 95% confidence level, which is usually the way you’d select the lag value when you first run autoregression models if you were selecting things manually, so the selection makes sense.

Now, let’s make some forecasts and see how they compare to actuals.

In this bit of code, we’ve made predictions and then combined the prediction values with the ‘test’ data from the `sales_data` dataframe.

That’s really not a bad model at it shows trend and movements (high/lows, etc) well but doesn’t quite get the extreme values.   Let’s check our root mean square error.

This gives us a root mean square value of `0.64`, which isn’t terrible but there is room for improvement here.

One thing to note about `statsmodels AR()` libary is that it makes it difficult to use this in on ‘online’ fashion (e.g., train a model and then add new data points as they come in). You’d need to either retrain your model based on the new datapoint added or just save the coefficients from the model and predict your own values as needed.

I hope this has been a good introduction of forecasting time series data using autoregression in python. A always, if you have any questions or comments, leave them in the comment section or contact me.

Note: If you have some interest in learning more about determining stationarity and other methods for eliminating trend and seasonality beyond just differencing, let me know and i’ll put another post up that talks about those things in detail.

### Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me.

## Quick Tip: Consuming Google Search results to use for web scraping

While working on a project recently, I needed to grab some google search results for specific search phrases and then scrape the content from the page results.

For example, when searching for a Sony 16-35mm f2.8 GM lens on google, I wanted to grab some content (reviews, text, etc) from the results.  While this isn’t hard to build from scratch, I ran across a couple of libraries that are easy to use and make things so much easier.

The first is ‘Google Search‘ (install via `pip install google`). This library lets you consume google search results with just one line of code. An example is below (this will import google search and run a search for `Sony 16-35mm f2.8 GM lens` and print out the urls for the search.

For the above, I’m using `google.com` for the search and have told it to stop after the first set of results.

The output:

That’s pretty easy.

Now, we can use those url’s to scrape the websites that are returned.

To scrape these sites, you could run some fairly complex scraping systems, build your own fairly complex systems…or…if you just need some basic content and aren’t going to be doing a LOT of scraping, you could use the ‘Newspaper‘ library. Of course, there are plenty of other libraries but the newspaper library really simplifies things for those ‘quick and dirty’ projects.  Note: This is best used in python3.

To get started, install newspaper with `pip3 install newspaper3k` (for python3).

Now, to scrape the urls returned from the google search, you can simply do the following:

This will grab the url, download it and parse it so you can access the content.  Here’s an example of grabbing the url `https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.`

The output of the `print(article.text` is below (I’ve only included an excerpt for this example but this will grab the entire text):

‘Those putting together the ultimate Sony E-mount lens kit are going to want this lens included. The Sony FE 16-35mm f/2.8 GM Lens covers a key focal length range in wide aperture with high quality. In this case, the term high quality applies both to the lens\’ physical attributes and to the image quality delivered by it.\n\nMany are first-attracted to the Alpha MILC (Mirrorless Interchangeable Lens Camera) system for Sony\’s high-performing full frame imaging sensors, but lenses are as important as cameras and Sony\’s lens lineup was initially viewed by many as deficient. Adapting Canon brand lenses for use on Sony cameras was prevalent. The introduction of Sony\’s flagship Grand Master line (the “GM” in the name) was very welcomed by Sony owners and this line is proving attractive to those considering a switch to the Sony camp. The 16-35mm f/2.8 GM is one more reason to stay entirely within the Sony brand.\n\nFocal Length Range\n\nWhen starting a kit, most will first select a general purpose lens (Sony system owners should seriously consider the Sony FE 24-70mm f/2.8 GM Lens) and one of the next-most-needed lenses is typically a wide-angle zoom. This 16-35mm range ideally covers that need.\n\nThe 107° angle of view provided by a 16mm focal length is ultra-wide and all of the narrower angles of view down to 63°, just modestly-wide, are included. To explore what this focal length range looks like, we head to RB Rickett\’s falls in Ricketts Glen State Park.\n\nOne of the most popular uses for this range is, as illustrated above, landscape photography.

Now, one of the really cool features of the `newspaper` library is that it has built-in natural language processing capabilities and can return keywords, summaries and other interesting tidbits. To get this to work, you must have the Natural Language Toolkit (NLTK) installed (install with `pip install nltk`) and have the `punkt` package installed from nltk. Here’s an example using the previous url (and assuming you’ve already done the above steps).

The result:

That’s quite nice (and easy!).  Of course, If I were doing this as a serious NLP Project, i’d write my own NLP functions but for a quick look at keywords of an article, this is a fast way to do it.

If you want to learn more about Natural Language Processing using NLTK, the definitive book is Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.

Photo by Émile Perron on Unsplash

## Quick Tip: Comparing two pandas dataframes and getting the differences

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

dataframe 2 (named df2):

The answer, it seems, is quite simple – but I couldn’t figure it out at the time.  Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code (as provided by user alko on stackoverlow):