Category: data analytics

Forecasting Time Series data with Prophet – Part 4

This is the fourth in a series of posts about using Forecasting Time Series data with Prophet. The other parts can be found here:

In those previous posts, I looked at forecasting monthly sales data 24 months into the future using some example sales data that you can find here.

In this post, I want to look at the output of Prophet to see how we can apply some metrics to measure ‘accuracy’.  When we start looking at ‘accuracy’ of forecasts, we can really do a whole lot of harm by using the wrong metrics and the wrong data to measure accuracy.  That said, its good practice to always try to compare your predicted values with your actual values to see how well or poorly your model(s) are performing.

For the purposes of this post, I’m going to expand on the data in the previous posts. For this post we are using fbprophet version 0.2.1.  Also – we’ll need scikit-learn and scipy installed for looking at some metrics.

Note: While I’m using Prophet to generate the models, these metrics and tests for accuracy can be used with just about any modeling approach.

Since the majority of the work has been covered in Part 3, I’m going to skip down to the metrics section…you can see the entire code and follow along with the jupyter notebook here.

In the notebook, we’ve loaded the data. The visualization of the data looks like this:

sales monthly data

Our prophet model forecast looks like:

sales monthly data forecast

Again…you can see all the steps in thejupyter notebook if you want to follow along step by step.

Now that we have a prophet forecast for this data, let’s combine the forecast with our original data so we can compare the two data sets.

The above line of code takes the actual forecast data ‘yhat’ in the forecast dataframe, sets the index to be ‘ds’ on both (to allow us to combine with the original data-set) and then joins these forecasts with the original data. lastly, we reset the indexes to get back to the non-date index that we’ve been working with (this isn’t necessary…just a step I took).

The new dataframe looks like this:

combined dataframe

You can see from the above, that the last part of the dataframe has “NaN” for ‘y’…that’s fine because we are only concerned about checking the forecast values versus the actual values so we can drop these “NaN” values.

Now, we have a dataframe with just the original data (in the ‘y’ column) and forecasted data (in the yhat column) to compare.

Now, we are going to take a look at a few metrics.

Metrics for measuring modeling accuracy

If you ask 100 different statisticians, you’ll probably get at least 50 different answers on ‘the best’ metrics to use for measuring accuracy of models.  For most cases, using either R-Squared, Mean Squared Error and Mean Absolute Error (or a combo of them all) will get you a good enough measure of the accuracy of your model.

For me, I like to use R-Squared and Mean Absolute Error (MAE).  With these two measures, I feel like I can get a really good feel for how well (or poorly) my model is doing.

Python’s ScitKit Learn has some good / easy methods for calculating these values.  To use them, you’ll need to import them (and have scitkit-learn and scipy installed). If you don’t have scitkit-learn and scipy installed, you can do so with the following command:

Now, you can import the metrics with the following command:

To calculate R-Squared, we simply do the following:

For this data, we get an R-Squared value of 0.99.   Now…this is an amazing value…it can be interpreted to mean that 99% of the variance in this data is explained by the model. Pretty darn good (but also very very naive in thinking). When I see an R-Squared value like this, I immediately think that the model has been overfit.   If you want to dig into a good read on what R-Squared means and how to interpret it, check out this post.

Now, let’s take a look at MSE.

The MSE turns out to be 11,129,529.44. That’s a huge value…an MSE of 11 million tells me this model isn’t all that great, which isn’t surprising given the low number of data points used to build the model.  That said, a high MSE isn’t a bad thing necessarily but it give you a good feel for the accuracy you can expect to see.

Lastly, let’s take a look at MAE.

For this model / data, the MAE turns out to be 2,601.15, which really isn’t all that bad. What that tells me is that for each data point, my average magnitude of error is roughly $2,600, which isn’t all that bad when we are looking at sales values in the $300K to $500K range.  BTW – if you want to take a look at an interesting comparison of MAE and RMSE (Root Mean Squared Error), check out this post.

Hopefully this has been helpful.  It wasn’t the intention of this post to explain the intricacies of these metrics, but hopefully you’ve seen a bit about how to use metrics to measure your models. I may go into more detail on modeling / forecasting accuracies in the future at some point. Let me know if you have any questions on this stuff…I’d be happy to expand if needed.

Note: In the jupyter notebook,  I show the use of a new metrics library I found called ML-Metrics. Check it out…its another way to run some of the metrics.

If you want to learn more about time series forecating, here’s a few good books on the subject. These are Amazon links…I’d appreciate it if you used them if you purchase these books as the little bit of income that comes from these links helps pay for the server this blog runs on.


Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “” and run it as “python” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:


Dask – A better way to work with large CSV files in Python

Dask dataframeIn a recent post titled Working with Large CSV files in Python, I shared an approach I use when I have very large CSV files (and other file types) that are too large to load into memory. While the approach I previously highlighted works well, it can be tedious to first load data into sqllite (or any other database) and then access that database to analyze data.   I just found a better approach using Dask.

While looking around the web to learn about some parallel processing capabilities, I ran across a python module named Dask, which describes itself as:

…is a flexible parallel computing library for analytic computing.

When I saw that, I was intrigued. There’s a lot that can be done with that statement  and I’ve got plans to introduce Dask into my various tool sets for data analytics.

While reading the docs, I ran across the ‘dataframe‘ concept and immediately new I’d found a new tool for working with large CSV files.  With Dask’s dataframe concept,  you can do out-of-core analysis (e.g., analyze data in the CSV without loading the entire CSV file into memory). Other than out-of-core manipulation, dask’s dataframe uses the pandas API, which makes things extremely easy for those of us who use and love pandas.

With Dask and its dataframe construct, you set up the dataframe must like you would in pandas but rather than loading the data into pandas, this appraoch keeps the dataframe as a sort of ‘pointer’ to the data file and doesn’t load anything until you specifically tell it to do so.

One note (that I always have to share):  If you are planning on working with your data set over time, its probably best to get the data into a database of some type.

An example using Dask and the Dataframe

First, let’s get everything installed. The documentation claims that you just need to install dask, but I had to install ‘toolz’ and ‘cloudpickle’ to get dask’s dataframe to import.  To install dask and its requirements, open a terminal and type (you need pip for this):

NOTE: I mistakenly had “pip install dask” listed initially. This only installs the base dask system and not the dataframe (and other dependancies). Thanks to Kevin for pointing this out.

Now, let’s write some code to load csv data and and start analyzing it. For this example, I’m using the 311 Service Requests dataset from NYC’s Open Data portal.   You can download the dataset here: 311 Service Requests – 7Gb+ CSV

Set up your dataframe so you can analyze the 311_Service_Requests.csv file. This file is assumed to be stored in the directory that you are working in.

Unlike pandas, the data isn’t read into memory…we’ve just set up the dataframe to be ready to do some compute functions on the data in the csv file using familiar functions from pandas. Note: I used “dtype=’str'” in the read_csv to get around some strange formatting issues in this particular file.

Let’s take a look at the first few rows of the file using pandas’ head() call.  When you run this, the first X rows (however many rows you are looking at with head(X)) and then displays those rows.

Note: a small subset of the columns are shown below for simplicity

Unique KeyCreated DateClosed DateAgency
2551348105/09/2013 12:00:00 AM05/14/2013 12:00:00 AMHPD
2551348205/09/2013 12:00:00 AM05/13/2013 12:00:00 AMHPD
2551348305/09/2013 12:00:00 AM05/22/2013 12:00:00 AMHPD
2551348405/09/2013 12:00:00 AM05/12/2013 12:00:00 AMHPD
2551348505/09/2013 12:00:00 AM05/11/2013 12:00:00 AMHPD

We see that there’s some spaces in the column names. Let’s remove those spaces to make things easier to work with.

The cool thing about dask is that you can do things like renaming columns without loading all the data into memory.

There’s a column in this data called ‘Descriptor’ that has the problem types, and “radiator” is one of those problem types. Let’s take a look at how many service requests were because of some problem with a radiator.  To do this, you can filter the dataframe using standard pandas filtering (see below) to create a new dataframe.

Let’s see how many rows we have using the ‘count’ command

You’ll notice that when you run the above command, you don’t actually get count returned. You get a descriptor back similar  like “dd.Scalar<series-…, dtype=int64>

To actually compute the count, you have to call “compute” to get dask to run through the dataframe and count the number of records.

When you run this command, you should get something like the following

The above are just some samples for using dask’s dataframe construct.  Remember, we built a new dataframe using pandas’ filters without loading the entire original data set into memory.  They may not seem like much, but when working with a 7Gb+ file, you can save a great deal of time and effort using dask when compared to using the approach I previously mentioned.

Dask seems to have a ton of other great features that I’ll be diving into at some point in the near future, but for now, the dataframe construct has been an awesome find.