Tag: time series

Forecasting Time Series data with Prophet – Jupyter Notebook

In previous posts, I described how I use Prophet forecasting time series data.  There were some questions in the comments about the code not working, so I wanted to publish a new post with a link to a Jupyter Notebook that will hopefully provide a full, correct working example.

The original posts are:

The Jupyter notebook can be found here:


Forecasting Time-Series data with Prophet – Part 2

Note: There’s been some questions (and some issues with my original code). I’ve uploaded a jupyter notebook with corrected code for Part 1 and Part 2.  The notebook can be found here.

In Forecasting Time-Series data with Prophet – Part 1, I introduced Facebook’s Prophet library for time-series forecasting.   In this article, I wanted to take some time to share how I work with the data after the forecasts. Specifically, I wanted to share some tips on how I visualize the Prophet forecasts using matplotlib rather than relying on the default prophet charts (which I’m not a fan of).

Just like part 1, I’m going to be using this retail sales example csv file find on github.

For this work, we’ll need to import matplotlib and set up some basic parameters to be format our plots in a nice way (unlike the hideous default matplotlib format).

from fbprophet import Prophet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #only needed for jupyter

With this chunk of code, we import fbprophet, numpy, pandas and matplotlib. Additionally, since I’m working in jupyter notebook, I want to add the “`%matplotlib inline“` instruction to view the charts that are created during the session. Lastly, I set my figuresize and sytle to use the ‘ggplot’ style.

Since I’ve already described the analysis phase with Prophet, I’m not going to provide commentary on it here. You can jump back to Part 1 for a walk-through.

sales_df = pd.read_csv('examples/retail_sales.csv')
sales_df['y_orig']=sales_df.y # We want to save the original data for later use
sales_df['y'] = np.log(sales_df['y']) #take the log of the data to remove trends, etc
model = Prophet()
#create 12 months of future data
future_data = model.make_future_dataframe(periods=12, freq = 'm')
#forecast the data for future data
forecast_data = model.predict(future_data)

At this point, your data should look like this:

sample output of sales forecast


Now, let’s plot the output using Prophet’s built-in plotting capabilities.




Plot from fbprophet

While this is a nice chart, it is kind of ‘busy’ for me.  Additionally, I like to view my forecasts with original data first and forecasts appended to the end (this ‘might’ make sense in a minute).

First, we need to get our data combined and indexed appropriately to start plotting. We are only interested (at least for the purposes of this article) in the ‘yhat’, ‘yhat_lower’ and ‘yhat_upper’ columns from the Prophet forecasted dataset.  Note: There are much more pythonic ways to these steps, but I’m breaking them out for each of understanding.

sales_df.set_index('ds', inplace=True)
forecast_data.set_index('ds', inplace=True)
viz_df = sales_df.join(forecast_data[['yhat', 'yhat_lower','yhat_upper']], how = 'outer')
del viz_df['y']
del viz_df['index']

You don’t need to delete the ‘y’and ‘index’ columns, but it makes for a cleaner dataframe.

If you ‘tail’ your dataframe, your data should look something like this:

final dataframe for visualization

You’ll notice that the ‘y_orig’ column is full of “NaN” here. This is due to the fact that there is no original data for the ‘future date’ rows.

Now, let’s take a look at how to visualize this data a bit better than the Prophet library does by default.

First, we need to get the last date in the original sales data. This will be used to split the data for plotting.

sales_df.index = pd.to_datetime(sales_df.index)
last_date = sales_df.index[-1]

To plot our forecasted data, we’ll set up a function (for re-usability of course). This function imports a couple of extra libraries for subtracting dates (timedelta) and then sets up the function.

from datetime import date,timedelta
def plot_data(func_df, end_date):
    end_date = end_date - timedelta(weeks=4) # find the 2nd to last row in the data. We don't take the last row because we want the charted lines to connect
    mask = (func_df.index > end_date) # set up a mask to pull out the predicted rows of data.
    predict_df = func_df.loc[mask] # using the mask, we create a new dataframe with just the predicted data.
# Now...plot everything
    fig, ax1 = plt.subplots()
    ax1.plot((np.exp(predict_df.yhat)), color='black', linestyle=':')
    ax1.fill_between(predict_df.index, np.exp(predict_df['yhat_upper']), np.exp(predict_df['yhat_lower']), alpha=0.5, color='darkgray')
    ax1.set_title('Sales (Orange) vs Sales Forecast (Black)')
    ax1.set_ylabel('Dollar Sales')
# change the legend text
    L=ax1.legend() #get the legend
    L.get_texts()[0].set_text('Actual Sales') #change the legend text for 1st plot
    L.get_texts()[1].set_text('Forecasted Sales') #change the legend text for 2nd plot

This function does a few simple things. It finds the 2nd to last row of original data and then creates a new set of data (predict_df) with only the ‘future data’ included. It then creates a plot with confidence bands along the predicted data.

The ploit should look something like this:

Actual Sales vs Forecasted Sales

Hopefully you’ve found some useful information here. Check back soon for Part 3 of my Forecasting Time-Series data with Prophet.

Visualizing data – overlaying charts in python

Visualizing data is vital to analyzing data.  If you can’t see your data – and see it in multiple ways – you’ll have a hard time analyzing that data.  There are quite a few ways to visualize data and, thankfully, with pandas, matplotlib and/or seaborn, you can make some pretty powerful visualizations during analysis.

One of the things I like to do when I get a new dataset is try to visualize data points against each other to see if there’s anything that jumps out at me.   To do this, I like to overlay charts against each other to find any patterns in the data / charts. With matplotlib, this is pretty easy to do but working with dual-axis can be a bit confusing at first.

Want  to learn more about data visualization and/or matplotlib? Here are a few books / websites with good info on the topic.

One chart that I like to look at for data that I know has a relationship – like sales revenue and number of widgets sold – is the dual overlay of revenue vs quantity.  An example of one of my go-to approaches for visualizing data is in Figure 1 below.

Visualizing data - revenue vs number of items
Figure 1: Visualizing data — Revenue vs Quantity chart overlay

In this chart, we have Monthly Sales Revenue (blue line) chart overlay-ed against the Number of Items Sold chart (multi-colored bar chart). This type of chart lets me quickly see if there are any easy patterns in the revenue vs # of items.

I’ve not found a quick/easy way to build the multi-colored bar chart without hacking the data and building each colored section manually…so if you know a better way that what I share below, let me know.

An example

Here’s my code for building this chart using this data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline # needed for jupyter notebooks
plt.rcParams['figure.figsize']=(20,10) # set the figure size
plt.style.use('fivethirtyeight') # using the fivethirtyeight matplotlib theme
sales = pd.read_csv('examples/sales.csv') # Read the data in
sales.Date = pd.to_datetime(sales.Date) #set the date column to datetime
sales.set_index('Date', inplace=True) #set the index to the date column
# now the hack for the multi-colored bar chart: 
# create fiscal year dataframes covering the timeframes you are looking for. In this case,
# the fiscal year covered October - September.
# --------------------------------------------------------------------------------
# Note: This should be set up as a function, but for this small amount of data,
# I just manually built each fiscal year. This is not very pythonic and would
# suck to do if you have many years of data, but it isn't bad for a few years of data. 
# --------------------------------------------------------------------------------
fy10_all = sales[(sales.index >= '2009-10-01') & (sales.index < '2010-10-01')]
fy11_all = sales[(sales.index >= '2010-10-01') & (sales.index < '2011-10-01')]
fy12_all = sales[(sales.index >= '2011-10-01') & (sales.index < '2012-10-01')]
fy13_all = sales[(sales.index >= '2012-10-01') & (sales.index < '2013-10-01')]
fy14_all = sales[(sales.index >= '2013-10-01') & (sales.index < '2014-10-01')]
fy15_all = sales[(sales.index >= '2014-10-01') & (sales.index < '2015-10-01')]
# Let's build our plot
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()  # set up the 2nd axis
ax1.plot(sales.Sales_Dollars) #plot the Revenue on axis #1
# the next few lines plot the fiscal year data as bar plots and changes the color for each.
ax2.bar(fy10_all.index, fy10_all.Quantity,width=20, alpha=0.2, color='orange')
ax2.bar(fy11_all.index, fy11_all.Quantity,width=20, alpha=0.2, color='gray')
ax2.bar(fy12_all.index, fy12_all.Quantity,width=20, alpha=0.2, color='orange')
ax2.bar(fy13_all.index, fy13_all.Quantity,width=20, alpha=0.2, color='gray')
ax2.bar(fy14_all.index, fy14_all.Quantity,width=20, alpha=0.2, color='orange')
ax2.bar(fy15_all.index, fy15_all.Quantity,width=20, alpha=0.2, color='gray')
ax2.grid(b=False) # turn off grid #2
ax1.set_title('Monthly Sales Revenue vs Number of Items Sold Per Month')
ax1.set_ylabel('Monthly Sales Revenue')
ax2.set_ylabel('Number of Items Sold')
# Set the x-axis labels to be more meaningful than just some random dates.
labels = ['FY 2010', 'FY 2011','FY 2012', 'FY 2013','FY 2014', 'FY 2015']

This is just one way of visualizing data with python. Hopefully its a good example of a different approach that you may not have thought about.