Tag: pandas

Market Basket Analysis with Python and Pandas

Market Basket Analysis with Python and Pandas

If you’ve ever worked with retail data, you’ll most likely have run across the need to perform some market basket analysis (also called Cross-Sell recommendations).  If you aren’t sure what market basket analysis is, I’ve provided a quick overview below.

What is Market Basket Analysis?

In the simplest of terms, market basket analysis looks at retail sales data and determines what products are purchased together. For example, if you sell widgets and want to be able to recommend similar products and/or products that are purchased together, you can perform this type of analysis to be able to understand what products should be recommended when a user views a widget.

You can think of this type of analysis as generating the following ‘rules’:

  • If widget A, then recommend widget B, C and F
  • If widget L, then recommend widget X, Y and R

With these rules, you can then build our recommendation engines for your website, store and salespeople to use when selling products to customers. Market Basket Analysis requires a large amount of transaction data to work well. If you have a large amount of transactional data, you should be able to run a market basket analysis with ease. if you want to learn more about Market Basket Analysis, here’s some additional reading.

In the remainder of this article, I show you how to do this type of analysis using python and pandas.

Market Basket Analysis with Python and Pandas

There are a few approaches that you can take for this type of analysis.  You can use a pre-built library like MLxtend or you can build your own algorithm. I prefer the MLxtend library myself, but recently there’s been some memory issues using pandas and large datasets with MLxtend, so there have been times that I’ve needed to roll my own.

Below, I provide an example of using MLxtend as well as an example of how to roll your own analysis.

Market Basket Analysis with MLxtend

For this example, we’ll use the data set found here. This data-set contains enough data to be useful in understanding market basket analysis but isn’t too large that we can’t use MLxtend (because we can’t unstack the data, which is required to use MLxtend ).

To get started, you’ll need to have pandas and MLxtend installed:

Then, import your libraries:

Now, lets read in the data and then drop any rows that don’t have an invoice number. Lastly, we’ll convert the InvoiceNo column to a string. NOTE: I downloaded the data file from here and stored it in a subdirectory named data.

In this data, there are some invoices that are ‘credits’ instead of ‘debits’ so we want to remove those. They are indentified with “C” in the InvoiceNo field. We can see an example of these types of invoices with the following:

To remove these credit invoices, we can find all invoices with ‘C’ in them, and take the inverse of the results. That can be accomplished with the following line of code:

Now, we are ready to start our market basket analysis. First, we’ll groupby the columns that we want to consider. For the purposes of this analysis, we’ll only look at the United Kingdom orders.

Next, we want to hot encode the data and get 1 transaction per row to prepare to run our mlxtend analysis.

Let’s take a look at the output:

market basket analysis example

Looks like a bunch of zeros. What good is that? Well…its exactly what we want to see. We’ve encoded our data to show when a product is sold with another product. If there is a zero, that means those products haven’t sold together. Before we continue, we want to convert all of our numbers to either a 1 or a 0 (negative numbers are converted to zero, positive numbers are converted to 1). We can do this encoding step with the following function:

And now, we do our final encoding step:

Now, lets find out which items are frequently purchased together. We do this by applying the mlxtend apriori fuuinction to our dataset.

There one thing we need to think about first. the apriori function requires us to provide a minimum level of ‘support’. Support is defined as the percentage of time that an itemset appears in the dataset. If you set support = 50%, you’ll only get itemsets that appear 50% of the time. I like to set support to around 5% when starting out to be able to see some data/results and then adjust from there. Setting the support level to high could lead to very few (or no) results and setting it too low could require an enormous amount of memory to process the data.

In the case of this data, I originally set the min_support to 0.05 but didn’t receive any results, so I changed it to 0.03.

The final step is to build your association rules using the mxltend association_rules function. You can set the metric that you are most interested in (either lift or confidence and set the minimum threshold for the condfidence level (called min_threshold). The min_threshold can be thought of as the level of confidence percentage that you want to return. For example, if you set min_threshold to 1, you will only see rules with 100% confidence. I usually set this to 0.7 to start with.

With this, we generate 16 rules for our market basket analysis.

MLxtend rules for market basket analysis

This gives us a good number of data points to look at for this analysis. Now, what does this tell us?

If you look in the antecedents column and the consequents column, you’ll see names of products. Each rule tells us that the antecedents is sold along with the consequents. You can use this information to build a cross-sell recommendation system that promotes these products with each other on your website (or in person when doing in-person sales).

Without knowing much more about the business that generated this data, we can’t really do much more with it. If you were using your own data, you’d be able to dig a bit deeper to find those rules with higher confidence and/or lift to help you understand the items that are sold together most often and start building strategies to promote those items (or other items if you are trying to grow sales in other areas of your business).

When can you not use MLxtend?

MLxtend can be used anytime you want and it is my preferred approach for market basket analysis. That said, there’s an issue (as of the date of this article) with using pandas with large datasets when performing the step of unstacking the data with this line:

You can see the issue here.

When you run across this issue, you’ll need to find an approach to running a market basket analysis. You can probably find ways to work around the pandas unstack problem, but what I’ve done recently is just roll my own analysis (its actually pretty simple to do). That’s what I’ll show you below.

To get started, we need to import a few more libraries:

Let’s use our original dataframe and assign it to a new df so we know we are working with a completely new data-set vs the above. We’ll use the same United Kingdom filter that we did before

Now, lets grab just the order data. For this,we’ll get the InvoiceNo and StockCode columns since all we care about is whether an item exists on an invoice. Remember, we’ve already removed the ‘credit’ invoices in the above steps so all we have are regular invoices. NOTE: There *will* be differences in the output of this approach vs MLxtend’s approach just like there will be differences in other approaches you might use for market basket analysis.

Now that we have a pandas series of Items, Let’s calculate the item frequency and support values.

Let’s filter out any rows of data that doesn’t have support above our min_support level

We next need to filter out orders that only had 1 items ordered on the invoice, since those items won’t provide any insight into our market basket analysis.

Now, let’s calculate our stats dataframe again with this new order data-set.

Time to do the fun stuff. Calculating the itemsets / item pairs. We’ll create a function that will generate our itemsets and then send our new order dataset through the generator. Then, we calculate the frequency of each item with each other (named frequencyAC) as well as the support (named supportAC). Finally, we filter out the itemsets that are below our min_support level

Finally, we can calculate our association rules. First, let’s unstack our itemsets and create the necessary data columns for support, lift, etc.

Finally, let’s look at our final rules. We want to look at only those items that have confidence > 0.5.

Looking at the rules_over_50 data, we see our final set of rules using our ‘roll your own’ approach.

final rules for market basket

These rules are going to be a bit different than what we get with MLxtend, but that’s OK as it gives us another set of data to look at – and the only set of data to look at when your data is too large to use MLxtend. One extension to this approach would be to add in a step to replace the stockcode numbers with the item descriptions.  I’ll leave it to you to do that work.

Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

To use modin:

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.

Quick Tip: Comparing two pandas dataframes and getting the differences

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

dataframe 2 (named df2):

The answer, it seems, is quite simple – but I couldn’t figure it out at the time.  Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code (as provided by user alko on stackoverlow):

This simple approach leads to the correct answer:

There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is  fast enough for my needs.


Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.