Forecasting Time-Series data with Prophet – Part 1

This is part 1 of a series where I look at using Prophet for Time-Series forecasting in Python

A lot of what I do in my data analytics work is understanding time series data, modeling that data and trying to forecast what might come next in that data. Over the years I’ve used many different approaches, library and modeling techniques for modeling and forecasting with some success…and a lot of failure.

Recently, I’ve been looking for a simpler approach for my initial modeling and think I’ve found a very nice library in Facebook’s Prophet (available for both python and R). While this particular library isn’t terribly robust, it is quick and gives some very good results for that initial pass at modeling / forecasting time series data.  An added bonus with Prophet for those that like to understand the theory behind things is this white paper with a very good description of the math / statistical approach behind Prophet.


If you are interested in learning more about time-series forecasting, check out the books / websites below.


Installing Prophet

To get started with Prophet, you’ll first need to install it (of course).

Installation instructions can be found here, but it should be as easy as doing the following (if you have an existing system that has the proper compilers installed):

For those running conda, you can install prophet via conda-forge using the following command:

Note: Prophet requres pystan, so you may need to also do the following (although in my case, it was installed as a requirement of fbprophet):

Pystan documentation can be found here.

Getting started

Using Prophet is extremely straightforward. You import it, load some data into a pandas dataframe, set the data up into the proper format and then start modeling / forecasting.

First, import the module (plus some other modules that we’ll need):

Now, let’s load up some data. For this example,  I’m going to be using the retail sales example csv file find on github.

Now, we have a pandas dataframe with our data that looks something like this:

Prophet pandas dataframe example

Note the format of the dataframe. This is the format that Prophet expects to see. There needs to be a ‘ds’ column  that contains the datetime field and and a ‘y’ column that contains the value we are wanting to model/forecast.

Before we can do any analysis with this data, we need to log transform the ‘y’ variable to a try to convert non-stationary data to stationary. This also converts trends to more linear trends (see this website for more info). This isn’t always a perfect way to handle time-series data, but it works often enough that it can be tried initially without much worry.

To log-tranform the data, we can use np.log() on the ‘y’ column like this:

Your dataframe should now look like the following:

log transformed data for Prophet

Its time to start the modeling.  You can do this easily with the following command:

If you are running with monthly data, you’ll most likely see the following message after you run the above commands:

You can ignore this message since we are running monthly data.

Now its time to start forecasting. With Prophet, you start by building some future time data with the following command:

In this line of code, we are creating a pandas dataframe with 6 (periods = 6) future data points with a monthly frequency (freq = ‘m’).  If you’re working with daily data, you wouldn’t want include freq=’m’.

Now we forecast using the ‘predict’ command:

If you take a quick look at the data using .head() or .tail(), you’ll notice there are a lot of columns in the forecast_data dataframe. The important ones (for now) are ‘ds’ (datetime), ‘yhat’ (forecast), ‘yhat_lower’ and ‘yhat_upper’ (uncertainty levels).

You can view only these columns in a .tail() by running the following command.

Your dataframe should look like:

prophet forecasted data

Let’s take a look at a graph of this data to get an understanding of how well our model is working.

fbprophet forecast graph

That looks pretty good. Now, let’s take a look at the seasonality and trend components of our /data/model/forecast.

Prophet component plot for seasonality and trend

Since we are working with monthly data, Prophet will plot the trend and the yearly seasonality but if you were working with daily data, you would also see a weekly seasonality plot included.

From the trend and seasonality, we can see that the trend is a playing a large part in the underlying time series and seasonality comes into play more toward the beginning and the end of the year.

So far so good.  With the above info, we’ve been able to quickly model and forecast some data to get a feel for what might be coming our way in the future from this particular data set.

Before we go on to tweaking this model (which I’ll talk about in my next post), I wanted to share a little tip for getting your forecast plot to display your ‘original’ data so you can see the forecast in ‘context’ and in the original scale rather than the log-transformed data. You can do this by using np.exp() to get our original data back.

Let’s take a look at the forecast with the original data:

fbprophet forecast data with original data

Something looks wrong (and it is)!

Our original data is drawn on the forecast but the black dots (the dark line at the bottom of the chart) is our log-transform original ‘y’ data. For this to make any sense, we need to get our original ‘y’ data points plotted on this chart. To do this, we just need to rename our ‘y_orig’ column in the sales_df dataframe to ‘y’ to have the right data plotted. Be careful here…you want to make sure you don’t continue analyzing data with the non-log-transformed data.

And…plot it.

fbprohpet original corrected

There we go…a forecast for retail sales 6 months into the future (you have to look closely at the very far right-hand side for the forecast). It looks like the next six months will see sales between 450K and 475K.


Check back soon for my next post on using Prophet for forecasting time-series data where I talk about how to tweak the models that come out of prophet.

 

 

Collecting / Storing Tweets with Python and MySQL

A few days ago, I published Collecting / Storing Tweets with Python and MongoDB. In that post, I describe the steps needed to collect and store tweets gathered via the Twitter Streaming API.

I received a comment on that post asking how to store data into MySQL instead of MongoDB. Here’s what you’d need to do to make that change.

Collecting / Storing Tweets with Python and MySQL

In the previous post, we store the twitter data into MongoDB with one line of code:

Unlike MongoDB where we can insert the entire json object, if we want to use MySQL instead of MongoDB, we need to do some additional work with the ‘datajson’ object before storing it.

Let’s assume that we are interested in just capturing the username, date, Tweet and Tweet ID from twitter.  This is most likely the bare minimum information you’d want to capture from the API…there are many (many) more fields available but for now, we’ll go with these four.

Note: I’ll hold off on the MySQL specific changes for now and touch on them shortly.

Once you capture the tweet (line 38 in my original script) and have it stored in your datajson object, create a few variables to store the date, username, Tweet and ID.

Note in the above that we are using the parser.parse() command from the dateutil  module to parse the created_at date for storage into Mysql.

Now that we have our variables ready to go, we’ll need to store those variables into MySQL.  Before we can do that, we need to set up a MySQL connection. I use the python-mysql connector but you are free to use what you need to.  You’ll need to do a import MySQLdb  to get the connector imported into your script (and assuming you installed the connector with pip install mysql-python.

You’ll need to create a table to store this data. You can use the sql statement below to do so if you need assistance / guidance.

Now, let’s set up our MySQL connection, query and execute/commit for the script. I’ll use a function for this to be able to re-use it for each tweet captured.

That’s it.  You are now collecting tweets and storing those tweets into a MySQL database.

Full Script

 


MySQL for PythonWant more information on working with MySQL and Python? Check out the book titled MySQL for Python by Albert Lukaszewski.

 

Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script: