Tag: python

Market Basket Analysis with Python and Pandas

Market Basket Analysis with Python and Pandas

If you’ve ever worked with retail data, you’ll most likely have run across the need to perform some market basket analysis (also called Cross-Sell recommendations).  If you aren’t sure what market basket analysis is, I’ve provided a quick overview below.

What is Market Basket Analysis?

In the simplest of terms, market basket analysis looks at retail sales data and determines what products are purchased together. For example, if you sell widgets and want to be able to recommend similar products and/or products that are purchased together, you can perform this type of analysis to be able to understand what products should be recommended when a user views a widget.

You can think of this type of analysis as generating the following ‘rules’:

  • If widget A, then recommend widget B, C and F
  • If widget L, then recommend widget X, Y and R

With these rules, you can then build our recommendation engines for your website, store and salespeople to use when selling products to customers. Market Basket Analysis requires a large amount of transaction data to work well. If you have a large amount of transactional data, you should be able to run a market basket analysis with ease. if you want to learn more about Market Basket Analysis, here’s some additional reading.

In the remainder of this article, I show you how to do this type of analysis using python and pandas.

Market Basket Analysis with Python and Pandas

There are a few approaches that you can take for this type of analysis.  You can use a pre-built library like MLxtend or you can build your own algorithm. I prefer the MLxtend library myself, but recently there’s been some memory issues using pandas and large datasets with MLxtend, so there have been times that I’ve needed to roll my own.

Below, I provide an example of using MLxtend as well as an example of how to roll your own analysis.

Market Basket Analysis with MLxtend

For this example, we’ll use the data set found here. This data-set contains enough data to be useful in understanding market basket analysis but isn’t too large that we can’t use MLxtend (because we can’t unstack the data, which is required to use MLxtend ).

To get started, you’ll need to have pandas and MLxtend installed:

pip install pandas mlxtend

Then, import your libraries:

import pandas as pd

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Now, lets read in the data and then drop any rows that don’t have an invoice number. Lastly, we’ll convert the `InvoiceNo` column to a string. NOTE: I downloaded the data file from here and stored it in a subdirectory named `data`.

df = pd.read_excel('data/Online Retail.xlsx')
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

In this data, there are some invoices that are ‘credits’ instead of ‘debits’ so we want to remove those. They are indentified with “C” in the InvoiceNo field. We can see an example of these types of invoices with the following:

df[df.InvoiceNo.str.contains('C', na=False)].head()

To remove these credit invoices, we can find all invoices with ‘C’ in them, and take the inverse of the results. That can be accomplished with the following line of code:

df = df[~df['InvoiceNo'].str.contains('C')]

Now, we are ready to start our market basket analysis. First, we’ll groupby the columns that we want to consider. For the purposes of this analysis, we’ll only look at the United Kingdom orders.

market_basket = df[df['Country'] =="United Kingdom"].groupby(
                ['InvoiceNo', 'Description'])['Quantity']

Next, we want to hot encode the data and get 1 transaction per row to prepare to run our mlxtend analysis.

market_basket = market_basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

Let’s take a look at the output:

market_basket.head()

market basket analysis example

Looks like a bunch of zeros. What good is that? Well…its exactly what we want to see. We’ve encoded our data to show when a product is sold with another product. If there is a zero, that means those products haven’t sold together. Before we continue, we want to convert all of our numbers to either a `1` or a `0` (negative numbers are converted to zero, positive numbers are converted to 1). We can do this encoding step with the following function:

def encode_data(datapoint):
    if datapoint <= 0:
        return 0
    if datapoint >= 1:
        return 1

And now, we do our final encoding step:

market_basket = market_basket.applymap(encode_data)

Now, lets find out which items are frequently purchased together. We do this by applying the mlxtend `apriori` fuuinction to our dataset.

There one thing we need to think about first. the `apriori` function requires us to provide a minimum level of ‘support’. Support is defined as the percentage of time that an itemset appears in the dataset. If you set support = 50%, you’ll only get itemsets that appear 50% of the time. I like to set support to around 5% when starting out to be able to see some data/results and then adjust from there. Setting the support level to high could lead to very few (or no) results and setting it too low could require an enormous amount of memory to process the data.

In the case of this data, I originally set the `min_support` to 0.05 but didn’t receive any results, so I changed it to 0.03.

itemsets = apriori(market_basket, min_support=0.03, use_colnames=True)

The final step is to build your association rules using the mxltend `association_rules` function. You can set the metric that you are most interested in (either `lift` or `confidence` and set the minimum threshold for the condfidence level (called `min_threshold`). The `min_threshold` can be thought of as the level of confidence percentage that you want to return. For example, if you set `min_threshold` to 1, you will only see rules with 100% confidence. I usually set this to 0.7 to start with.

rules = association_rules(itemsets, metric="lift", min_threshold=0.5)

With this, we generate 16 rules for our market basket analysis.

MLxtend rules for market basket analysis

This gives us a good number of data points to look at for this analysis. Now, what does this tell us?

If you look in the `antecedents` column and the `consequents` column, you’ll see names of products. Each rule tells us that the `antecedents` is sold along with the `consequents`. You can use this information to build a cross-sell recommendation system that promotes these products with each other on your website (or in person when doing in-person sales).

Without knowing much more about the business that generated this data, we can’t really do much more with it. If you were using your own data, you’d be able to dig a bit deeper to find those rules with higher confidence and/or lift to help you understand the items that are sold together most often and start building strategies to promote those items (or other items if you are trying to grow sales in other areas of your business).

When can you not use MLxtend?

MLxtend can be used anytime you want and it is my preferred approach for market basket analysis. That said, there’s an issue (as of the date of this article) with using pandas with large datasets when performing the step of unstacking the data with this line:

market_basket = market_basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')

You can see the issue here.

When you run across this issue, you’ll need to find an approach to running a market basket analysis. You can probably find ways to work around the pandas unstack problem, but what I’ve done recently is just roll my own analysis (its actually pretty simple to do). That’s what I’ll show you below.

To get started, we need to import a few more libraries:

from itertools import combinations, groupby
from collections import Counter

Let’s use our original dataframe and assign it to a new df so we know we are working with a completely new data-set vs the above. We’ll use the same United Kingdom filter that we did before

df_manual = df[df['Country'] =="United Kingdom"]

Now, lets grab just the order data. For this,we’ll get the InvoiceNo and StockCode columns since all we care about is whether an item exists on an invoice. Remember, we’ve already removed the ‘credit’ invoices in the above steps so all we have are regular invoices. NOTE: There *will* be differences in the output of this approach vs MLxtend’s approach just like there will be differences in other approaches you might use for market basket analysis.

orders = df_manual.set_index('InvoiceNo')['StockCode']

Now that we have a pandas series of Items, Let’s calculate the item frequency and support values.

statistics = orders.value_counts().to_frame("frequency")
statistics['support']  = statistics / len(set(orders.index)) * 100

Let’s filter out any rows of data that doesn’t have support above our min_support level

min_support=0.03 # same value we used above.

items_above_support = statistics[statistics['support'] >= min_support].index
orders_above_support = orders[orders.isin(items_above_support)]

We next need to filter out orders that only had 1 items ordered on the invoice, since those items won’t provide any insight into our market basket analysis.

order_counts = orders.index.value_counts()
orders_over_two_index = order_counts[order_counts>=2].index
orders_over_two = orders[orders.index.isin(orders_over_two_index)]

Now, let’s calculate our stats dataframe again with this new order data-set.

statistics = orders_over_two.value_counts().to_frame("frequency")
statistics['support']  = statistics / len(set(orders_over_two.index)) * 100

Time to do the fun stuff. Calculating the itemsets / item pairs. We’ll create a function that will generate our itemsets and then send our new order dataset through the generator. Then, we calculate the frequency of each item with each other (named `frequencyAC`) as well as the support (named `supportAC`). Finally, we filter out the itemsets that are below our `min_support` level

def itemset_generator(orders):
    orders = orders.reset_index().values
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
        for item_pair in combinations(item_list, 2):
            yield item_pair

itemsets_gen = itemset_generator(orders_over_two)
itemsets  = pd.Series(Counter(itemsets_gen)).to_frame("frequencyAC")
itemsets['supportAC'] = itemsets['frequencyAC'] / len(orders_over_two_index) * 100
itemsets = itemsets[itemsets['supportAC'] >= min_support]

Finally, we can calculate our association rules. First, let’s unstack our itemsets and create the necessary data columns for support, lift, etc.

# Create table of association rules and compute relevant metrics
itemsets = itemsets.reset_index().rename(columns={'level_0': 'antecedents', 'level_1': 'consequents'})

itemsets = (itemsets
     .merge(statistics.rename(columns={'freq': 'freqA', 'support': 'antecedent support'}), left_on='antecedents', right_index=True)
     .merge(statistics.rename(columns={'freq': 'freqC', 'support': 'consequents support'}), left_on='consequents', right_index=True))


itemsets['confidenceAtoC'] = itemsets['supportAC'] / itemsets['antecedent support']
itemsets['confidenceCtoA'] = itemsets['supportAC'] / itemsets['consequents support']
itemsets['lift'] = itemsets['supportAC'] / (itemsets['antecedent support'] * itemsets['consequents support'])

itemsets=itemsets[['antecedents', 'consequents','antecedent support', 'consequents support', 'confidenceAtoC','lift']]

Finally, let’s look at our final rules. We want to look at only those items that have confidence > 0.5.

rules = itemsets
rules_over_50 = rules[(rules.confidenceAtoC >0.50)]
rules_over_50.set_index('antecedents',inplace=True)
rules_over_50.reset_index(inplace=True)
rules_over_50=rules_over_50.sort_values('lift', ascending=False)

Looking at the rules_over_50 data, we see our final set of rules using our ‘roll your own’ approach.

final rules for market basket

These rules are going to be a bit different than what we get with MLxtend, but that’s OK as it gives us another set of data to look at – and the only set of data to look at when your data is too large to use MLxtend. One extension to this approach would be to add in a step to replace the stockcode numbers with the item descriptions.  I’ll leave it to you to do that work.

data roundup

Python Data Weekly Roundup – Dec 18 2019

In this week’s Python Data Weekly Roundup:

The Last Matplotlib Tweaking Guide You’ll Ever Need

This is a very good  ‘how to’ for beginners to learn to tweak the Matplotlib visualization library.  This article explains how to tweak matplotlib charts including changing the size, removing borders, changing colors and widths of chart lines.  Each tweak includes python code to make the tweaks.

Arithmetic, Geometric, and Harmonic Means for Machine Learning

Did you know there are different types of averages (aka means)?  After reading this article, you’ll have an understanding of what the difference is between the arithmetic, geometric and harmonic means are, why you should use one over the other and how to calculate them using python code.

What is My Data Worth?

Should you be paid for all the personal data that you’ve made available online? If so, what is that data worth?  In this fantastic article, Ruoxi Jia describes how to value personal data and describes how to apply the Shapley Value in data valuation and in general machine learning usages (e.g., interpreting black-box model predictions). An example of the Shapley Value is below. The below graphic shows two images from the article:

(a) The Shapley value produced by our proposed exact approach and the baseline Monte-Carlo approximation algorithm for the KNN classifier constructed with 1000 randomly selected training points from MNIST. (b) Runtime comparison of the two approaches as the training size increases.

Shapely Value Graph

FastSpeech: New text-to-speech model improves on speed, accuracy, and controllability

In this article, Microsoft Senior Research Xu Tan describes a new text-to speech model called FastSpeech. This new model is claimed to be fast, robust, controllable and high quality (which are all valuable and necessary features).   A deep dive of this model can be found here.

How to Develop Super Learner Ensembles in Python

Another great article from Jason Brownlee describing how to combine multiple models into an ensemble model for use in predictive modeling.  Jason provides python code that you can use to build your own Super Learner with scikit-learn. Additionally – and more importantly – Jason does a fantastic job of highlight the theory behind Super Learners with many links to articles and journals on the topic.

Strengthening the AI community

An overview of the DeepMind scholarship program as well as a description of why it makes sense to help others move into the field of AI.

Text Generation with Python

Natural Language Processing is well known as a way to analyze text.  I’ve written a bit about using NLP here on the site (see here and here). In this article, Julien Heiduk describes how he was able to use the GPT-2 model to generate text with python. In fact, the article is almost completely generated text via the GPT-2 model..and it does a good job of creating readable and understandable content.

Best Degree for Data Science (in One Picture)

Is there a ‘best’ degree for data science? Personally, I don’t think there is….but I can see there being better degrees for people that are just starting out.   For example, all things being equal on the personal front, a degree in statistics is going to be much better for you than a degree in horticulture…but…that’s not to say the statistics degree makes you a better data scientist…it just gives you the tools to get into the field quicker than someone with the horticulture degree. That said, I do like what Stephanie Glen says in this article when she writes: “getting a degree should be looked at as a stepping block, not a train ride to a destination. No single degree is likely to get you in the door.”


That’s it for this week’s Python Data Weekly Roundup. Subscribe to our newsletter to receive this weekly roundup in your email.

Comparing Machine Learning Methods

When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project.  A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I’ll be comparing machine learning methods using a few different sklearn algorithms.  As always, you can find a jupyter notebook for this article on my github here and find other articles on this topic here.

I’ve used Jason Brownlee’s article from 2016 as the basis for this article…I wanted to expand a bit on what he did as well as use a different dataset. In this article, we’ll be using the Indian Liver Disease dataset (found here).

From the dataset page:

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The “Dataset” column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Let’s get started by setting up our imports that we’ll use.

import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


Next, we’ll read in the data from the CSV file located in the local directory.

#read in the data
data = pd.read_csv('indian_liver_patient.csv')

If you do a head() of the dataframe, you’ll get a good feeling for the dataset.

Indian Liver Disease Data

We’ll use all columns except Gender for this tutorial. We could use gender by converting the gender to a numeric value (e.g., 0 for Male, 1 for Female) but for the purposes of this post, we’ll just skip this column.

data_to_use = data
del data_to_use['Gender']
data_to_use.dropna(inplace=True)

The ‘Dataset’ column is the value we are trying to predict…whether the user has liver disease or not so we’ll that as our “Y” and the other columns for our “X” array.

values = data_to_use.values

Y = values[:,9]
X = values[:,0:9]

Before we run our machine learning models, we need to set a random number to use to seed them. This can be any random number that you’d like it to be. Some people like to use a random number generator but for the purposes of this, I’ll just set it to 12 (it could just as easily be 1 or 3 or 1023 or any other number).

random_seed = 12

Now we need to set up our models that we’ll be testing out. We’ll set up a list of the models and give them each a name. Additionally, I’m going to set up the blank arrays/lists for the outcomes and the names of the models to use for comparison.

outcome = []
model_names = []
models = [('LogReg', LogisticRegression()), 
          ('SVM', SVC()), 
          ('DecTree', DecisionTreeClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LinDisc', LinearDiscriminantAnalysis()),
          ('GaussianNB', GaussianNB())]

We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.

for model_name, model in models:
    k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
    results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
    outcome.append(results)
    model_names.append(model_name)
    output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
    print(output_message)

The output from this loop is:

LogReg| Mean=0.718633 STD=0.058744
SVM| Mean=0.715124 STD=0.058962
DecTree| Mean=0.637568 STD=0.108805
KNN| Mean=0.651301 STD=0.079872
LinDisc| Mean=0.716878 STD=0.050734
GaussianNB| Mean=0.554719 STD=0.081961

From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.

fig = plt.figure()
fig.suptitle('Machine Learning Model Comparison')
ax = fig.add_subplot(111)
plt.boxplot(outcome)
ax.set_xticklabels(model_names)
plt.show()

Machine Learning Comparison

From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.