Category: python

Market Basket Analysis with Python and Pandas

Market Basket Analysis with Python and Pandas

If you’ve ever worked with retail data, you’ll most likely have run across the need to perform some market basket analysis (also called Cross-Sell recommendations).  If you aren’t sure what market basket analysis is, I’ve provided a quick overview below.

What is Market Basket Analysis?

In the simplest of terms, market basket analysis looks at retail sales data and determines what products are purchased together. For example, if you sell widgets and want to be able to recommend similar products and/or products that are purchased together, you can perform this type of analysis to be able to understand what products should be recommended when a user views a widget.

You can think of this type of analysis as generating the following ‘rules’:

  • If widget A, then recommend widget B, C and F
  • If widget L, then recommend widget X, Y and R

With these rules, you can then build our recommendation engines for your website, store and salespeople to use when selling products to customers. Market Basket Analysis requires a large amount of transaction data to work well. If you have a large amount of transactional data, you should be able to run a market basket analysis with ease. if you want to learn more about Market Basket Analysis, here’s some additional reading.

In the remainder of this article, I show you how to do this type of analysis using python and pandas.

Market Basket Analysis with Python and Pandas

There are a few approaches that you can take for this type of analysis.  You can use a pre-built library like MLxtend or you can build your own algorithm. I prefer the MLxtend library myself, but recently there’s been some memory issues using pandas and large datasets with MLxtend, so there have been times that I’ve needed to roll my own.

Below, I provide an example of using MLxtend as well as an example of how to roll your own analysis.

Market Basket Analysis with MLxtend

For this example, we’ll use the data set found here. This data-set contains enough data to be useful in understanding market basket analysis but isn’t too large that we can’t use MLxtend (because we can’t unstack the data, which is required to use MLxtend ).

To get started, you’ll need to have pandas and MLxtend installed:

pip install pandas mlxtend

Then, import your libraries:

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Now, lets read in the data and then drop any rows that don’t have an invoice number. Lastly, we’ll convert the `InvoiceNo` column to a string. NOTE: I downloaded the data file from here and stored it in a subdirectory named `data`.

df = pd.read_excel('data/Online Retail.xlsx')
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

In this data, there are some invoices that are ‘credits’ instead of ‘debits’ so we want to remove those. They are indentified with “C” in the InvoiceNo field. We can see an example of these types of invoices with the following:

df[df.InvoiceNo.str.contains('C', na=False)].head()

To remove these credit invoices, we can find all invoices with ‘C’ in them, and take the inverse of the results. That can be accomplished with the following line of code:

df = df[~df['InvoiceNo'].str.contains('C')]

Now, we are ready to start our market basket analysis. First, we’ll groupby the columns that we want to consider. For the purposes of this analysis, we’ll only look at the United Kingdom orders.

market_basket = df[df['Country'] =="United Kingdom"].groupby(
                ['InvoiceNo', 'Description'])['Quantity']

Next, we want to hot encode the data and get 1 transaction per row to prepare to run our mlxtend analysis.

market_basket = market_basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

Let’s take a look at the output:

market_basket.head()

market basket analysis example

Looks like a bunch of zeros. What good is that? Well…its exactly what we want to see. We’ve encoded our data to show when a product is sold with another product. If there is a zero, that means those products haven’t sold together. Before we continue, we want to convert all of our numbers to either a `1` or a `0` (negative numbers are converted to zero, positive numbers are converted to 1). We can do this encoding step with the following function:

def encode_data(datapoint):
    if datapoint <= 0:
        return 0
    if datapoint >= 1:
        return 1

And now, we do our final encoding step:

market_basket = market_basket.applymap(encode_data)

Now, lets find out which items are frequently purchased together. We do this by applying the mlxtend `apriori` fuuinction to our dataset.

There one thing we need to think about first. the `apriori` function requires us to provide a minimum level of ‘support’. Support is defined as the percentage of time that an itemset appears in the dataset. If you set support = 50%, you’ll only get itemsets that appear 50% of the time. I like to set support to around 5% when starting out to be able to see some data/results and then adjust from there. Setting the support level to high could lead to very few (or no) results and setting it too low could require an enormous amount of memory to process the data.

In the case of this data, I originally set the `min_support` to 0.05 but didn’t receive any results, so I changed it to 0.03.

itemsets = apriori(market_basket, min_support=0.03, use_colnames=True)

The final step is to build your association rules using the mxltend `association_rules` function. You can set the metric that you are most interested in (either `lift` or `confidence` and set the minimum threshold for the condfidence level (called `min_threshold`). The `min_threshold` can be thought of as the level of confidence percentage that you want to return. For example, if you set `min_threshold` to 1, you will only see rules with 100% confidence. I usually set this to 0.7 to start with.

rules = association_rules(itemsets, metric="lift", min_threshold=0.5)

With this, we generate 16 rules for our market basket analysis.

MLxtend rules for market basket analysis

This gives us a good number of data points to look at for this analysis. Now, what does this tell us?

If you look in the `antecedents` column and the `consequents` column, you’ll see names of products. Each rule tells us that the `antecedents` is sold along with the `consequents`. You can use this information to build a cross-sell recommendation system that promotes these products with each other on your website (or in person when doing in-person sales).

Without knowing much more about the business that generated this data, we can’t really do much more with it. If you were using your own data, you’d be able to dig a bit deeper to find those rules with higher confidence and/or lift to help you understand the items that are sold together most often and start building strategies to promote those items (or other items if you are trying to grow sales in other areas of your business).

When can you not use MLxtend?

MLxtend can be used anytime you want and it is my preferred approach for market basket analysis. That said, there’s an issue (as of the date of this article) with using pandas with large datasets when performing the step of unstacking the data with this line:

market_basket = market_basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

You can see the issue here.

When you run across this issue, you’ll need to find an approach to running a market basket analysis. You can probably find ways to work around the pandas unstack problem, but what I’ve done recently is just roll my own analysis (its actually pretty simple to do). That’s what I’ll show you below.

To get started, we need to import a few more libraries:

from itertools import combinations, groupby
from collections import Counter

Let’s use our original dataframe and assign it to a new df so we know we are working with a completely new data-set vs the above. We’ll use the same United Kingdom filter that we did before

df_manual = df[df['Country'] =="United Kingdom"]

Now, lets grab just the order data. For this,we’ll get the InvoiceNo and StockCode columns since all we care about is whether an item exists on an invoice. Remember, we’ve already removed the ‘credit’ invoices in the above steps so all we have are regular invoices. NOTE: There *will* be differences in the output of this approach vs MLxtend’s approach just like there will be differences in other approaches you might use for market basket analysis.

orders = df_manual.set_index('InvoiceNo')['StockCode']

Now that we have a pandas series of Items, Let’s calculate the item frequency and support values.

statistics = orders.value_counts().to_frame("frequency")
statistics['support']  = statistics / len(set(orders.index)) * 100

Let’s filter out any rows of data that doesn’t have support above our min_support level

min_support=0.03 # same value we used above.
items_above_support = statistics[statistics['support'] >= min_support].index
orders_above_support = orders[orders.isin(items_above_support)]

We next need to filter out orders that only had 1 items ordered on the invoice, since those items won’t provide any insight into our market basket analysis.

order_counts = orders.index.value_counts()
orders_over_two_index = order_counts[order_counts>=2].index
orders_over_two = orders[orders.index.isin(orders_over_two_index)]

Now, let’s calculate our stats dataframe again with this new order data-set.

statistics = orders_over_two.value_counts().to_frame("frequency")
statistics['support']  = statistics / len(set(orders_over_two.index)) * 100

Time to do the fun stuff. Calculating the itemsets / item pairs. We’ll create a function that will generate our itemsets and then send our new order dataset through the generator. Then, we calculate the frequency of each item with each other (named `frequencyAC`) as well as the support (named `supportAC`). Finally, we filter out the itemsets that are below our `min_support` level

def itemset_generator(orders):
    orders = orders.reset_index().values
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
        for item_pair in combinations(item_list, 2):
            yield item_pair
itemsets_gen = itemset_generator(orders_over_two)
itemsets  = pd.Series(Counter(itemsets_gen)).to_frame("frequencyAC")
itemsets['supportAC'] = itemsets['frequencyAC'] / len(orders_over_two_index) * 100
itemsets = itemsets[itemsets['supportAC'] >= min_support]

Finally, we can calculate our association rules. First, let’s unstack our itemsets and create the necessary data columns for support, lift, etc.

# Create table of association rules and compute relevant metrics
itemsets = itemsets.reset_index().rename(columns={'level_0': 'antecedents', 'level_1': 'consequents'})
itemsets = (itemsets
     .merge(statistics.rename(columns={'freq': 'freqA', 'support': 'antecedent support'}), left_on='antecedents', right_index=True)
     .merge(statistics.rename(columns={'freq': 'freqC', 'support': 'consequents support'}), left_on='consequents', right_index=True))

itemsets['confidenceAtoC'] = itemsets['supportAC'] / itemsets['antecedent support']
itemsets['confidenceCtoA'] = itemsets['supportAC'] / itemsets['consequents support']
itemsets['lift'] = itemsets['supportAC'] / (itemsets['antecedent support'] * itemsets['consequents support'])
itemsets=itemsets[['antecedents', 'consequents','antecedent support', 'consequents support', 'confidenceAtoC','lift']]

Finally, let’s look at our final rules. We want to look at only those items that have confidence > 0.5.

rules = itemsets
rules_over_50 = rules[(rules.confidenceAtoC >0.50)]
rules_over_50.set_index('antecedents',inplace=True)
rules_over_50.reset_index(inplace=True)
rules_over_50=rules_over_50.sort_values('lift', ascending=False)

Looking at the rules_over_50 data, we see our final set of rules using our ‘roll your own’ approach.

final rules for market basket

These rules are going to be a bit different than what we get with MLxtend, but that’s OK as it gives us another set of data to look at – and the only set of data to look at when your data is too large to use MLxtend. One extension to this approach would be to add in a step to replace the stockcode numbers with the item descriptions.  I’ll leave it to you to do that work.

Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

pip install modin

To use modin:

import modin.pandas as pd

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

from timeit import default_timer as timer
import pandas as pd
# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 
# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

pip install "modin[dask]"
pip install typing

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

from timeit import default_timer as timer
import modin.pandas as pd
# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 
# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.

Web scraping image

Quick Tip: Consuming Google Search results to use for web scraping

While working on a project recently, I needed to grab some google search results for specific search phrases and then scrape the content from the page results.

For example, when searching for a Sony 16-35mm f2.8 GM lens on google, I wanted to grab some content (reviews, text, etc) from the results.  While this isn’t hard to build from scratch, I ran across a couple of libraries that are easy to use and make things so much easier.

The first is ‘Google Search‘ (install via pip install google). This library lets you consume google search results with just one line of code. An example is below (this will import google search and run a search for Sony 16-35mm f2.8 GM lens and print out the urls for the search.

from googlesearch import search
for url in search('Sony 16-35mm f2.8 GM lens', tld='com', stop=1):
    print url

For the above, I’m using google.com for the search and have told it to stop after the first set of results.

The output:

https://www.bhphotovideo.com/c/product/1338516-REG/sony_sel1635gm_fe_16_35mm_f_2_8_gm.html
https://www.amazon.com/Sony-SEL1635GM-16-35mm-2-8-22-Camera/dp/B071LHLS11
https://www.sony.com/electronics/camera-lenses/sel1635gm



https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx
Sony FE 16-35mm f/2.8 GM lens review: Highest-rated wide-angle zoom
Review: Sony 16-35mm f2.8 G Master FE (Sony E Mount, Full Frame)
https://www.adorama.com/iso1635gm.html

That’s pretty easy.

Now, we can use those url’s to scrape the websites that are returned.

To scrape these sites, you could run some fairly complex scraping systems, build your own fairly complex systems…or…if you just need some basic content and aren’t going to be doing a LOT of scraping, you could use the ‘Newspaper‘ library. Of course, there are plenty of other libraries but the newspaper library really simplifies things for those ‘quick and dirty’ projects.  Note: This is best used in python3.

To get started, install newspaper with pip3 install newspaper3k (for python3).

Now, to scrape the urls returned from the google search, you can simply do the following:

from newspaper import Article
article = Article(url)
article.download()
article.parse()

This will grab the url, download it and parse it so you can access the content.  Here’s an example of grabbing the url https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.

from newspaper import Article
article = Article('https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx')
article.download()
article.parse()
print(article.text)

The output of the print(article.text is below (I’ve only included an excerpt for this example but this will grab the entire text):

‘Those putting together the ultimate Sony E-mount lens kit are going to want this lens included. The Sony FE 16-35mm f/2.8 GM Lens covers a key focal length range in wide aperture with high quality. In this case, the term high quality applies both to the lens\’ physical attributes and to the image quality delivered by it.\n\nMany are first-attracted to the Alpha MILC (Mirrorless Interchangeable Lens Camera) system for Sony\’s high-performing full frame imaging sensors, but lenses are as important as cameras and Sony\’s lens lineup was initially viewed by many as deficient. Adapting Canon brand lenses for use on Sony cameras was prevalent. The introduction of Sony\’s flagship Grand Master line (the “GM” in the name) was very welcomed by Sony owners and this line is proving attractive to those considering a switch to the Sony camp. The 16-35mm f/2.8 GM is one more reason to stay entirely within the Sony brand.\n\nFocal Length Range\n\nWhen starting a kit, most will first select a general purpose lens (Sony system owners should seriously consider the Sony FE 24-70mm f/2.8 GM Lens) and one of the next-most-needed lenses is typically a wide-angle zoom. This 16-35mm range ideally covers that need.\n\nThe 107° angle of view provided by a 16mm focal length is ultra-wide and all of the narrower angles of view down to 63°, just modestly-wide, are included. To explore what this focal length range looks like, we head to RB Rickett\’s falls in Ricketts Glen State Park.\n\nOne of the most popular uses for this range is, as illustrated above, landscape photography.

Now, one of the really cool features of the newspaper library is that it has built-in natural language processing capabilities and can return keywords, summaries and other interesting tidbits. To get this to work, you must have the Natural Language Toolkit (NLTK) installed (install with pip install nltk) and have the punkt package installed from nltk. Here’s an example using the previous url (and assuming you’ve already done the above steps).

import nltk
# Let's download punkt. 
# If already installed punkt,
# you can skip this step
nltk.download('punkt')
article.nlp() #this runs the natural language processing
print(article.keywords)

The result:

['focal', '1635mm', 'review', 'gm',
 'lens', 'sony', 'focus', 'aperture', 
'f28', 'fe', 'lenses']

That’s quite nice (and easy!).  Of course, If I were doing this as a serious NLP Project, i’d write my own NLP functions but for a quick look at keywords of an article, this is a fast way to do it.


If you want to learn more about Natural Language Processing using NLTK, the definitive book is Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.


Photo by Émile Perron on Unsplash