Recommendations drive revenue for large startup businesses around the world. From Amazon and Netflix to Etsy and Stitchfix, recommendation systems are able to tap into the data collected from transactions to recommend items that are frequently purchased together.
A consumer may make a purchase and have a need that is heavily associated with that item, but just forget about it in the moment. Alternatively, they may not even know a product for their pain point exists.
Whether the goal is to intercept a forgetful buyer or bring awareness to an unaware buyer, recommendation systems lift sales.
At the end of the day, the goal of these recommendation systems is to increase revenue in companies like Amazon or encourage more retention focused actions like watching a movie the user will love on Netflix.
These recommendations turn the data these companies collect into an asset and competitive effort.
The goal of this blog post is to show you how to identify associated products in your product set to give your company an edge.
Market basket analysis is an economic analysis that takes a group of transactions to discover what products are associated with which.
After processing multiple mathematical metrics on this dataset, we can identify if a product combination has more likelihood of being bought together than it would be alone or together by chance.
This analysis allows us to visualize recommendation trends and group-specific products together in a physical location like a grocery store or a website.
Through market basket analysis, you will develop rules with antecedents and consequents.
An antecedent is the first item or group of items in a rule that a customer purchases.
A consequent is the item or group of items that is recommended based on the antecedent.
In our analysis, we will create thousands of these rules to identify the recommendations that are most accurate and bring the biggest lift to the business.
Conducting a market basket analysis is pretty strenuous so we need to import quite a few libraries.
As always, we’re going to import pandas first. It is our super spreadsheet library. We will import several functions from mlxtend, a library that handles a lot of market analysis functions. We will import numpy so we can create binary combinations using AND logic.
Finally, we will import permutations from itertools so we can identify how strenuous creating this rules is computationally and choose a better option.
import pandas as pd from mlxtend.preprocessing import TransactionEncoder from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules import numpy as np from itertools import permutations
As always, we need to prepare our transaction data before being able to manipulate it. I will be using a grocery dataset from Kaggle since it will have the most data to use as an example for this product.
Consider using your own dataset to test this out for yourself, but feel free to use this dataset to follow along. We will first read in our dataset with a “read_csv” method.
groceries = pd.read_csv('groceries.csv', engine='python')
groceries['Transactions'] = groceries['Transactions'].apply(lambda t: t.split(','))
In order to properly evaluate the metrics we need for market basket analysis, we need to turn our transactions into a dataframe of booleans indicating whether or not a specific item in the store is included.
This will make it easer to evaluate the metrics we need.
We will use the TransactionEncoder object from mlxtend to fit all of our groceries. We will then transform them using the encoder and create a DataFrame with a column for each product and a row for each transaction.
True indicates that the product is included in that transaction and False indicates that the product is not included in that transaction.
encoder = TransactionEncoder().fit(groceries['Transactions'].to_list())
groceries = encoder.transform(groceries['Transactions'].to_list())
groceries = pd.DataFrame(groceries, columns=encoder.columns_)
The support metric is the frequency that a value occurs in each transaction. It is the total amount of times that the grocery item appears divided by the total number of transactions.
The support metric is incredibly important as it is the basis of every other metric that fuels our market basket analysis.
By calculating the mean of the entire transaction DataFrame, we can quickly create a DataFrame of all of the support values per item.
support = groceries.mean() support = support.reset_index() support.columns = ['Product', 'Support'] print(support)
The confidence value is probability that we will purchase a different product based on purchasing a product. It is calculated by diving the frequency that those two products are sold together by the frequency the first product is purchased.
Support(X&Y) / Support(X)
Confidence is between 0 and 1 since it is a probability. If the confidence is equal to the support of the next product known as a “consequent”, then there is no change in the likelihood that they would purchase that product based on their previous purchase.
This extra criterion makes it difficult to interpret so Lift and Conviction may be better metrics to use.
def confidence(antecedent, consequent):
antecedent_support = antecedent.mean()
both_support = np.logical_and(antecedent, consequent).mean()
confidence = both_support/antecedent_supportreturn confidence
The lift value is calculated as and represents the amount that a rule is based on a random chance. Lift ranges from -1 to 1.
A lift greater than 1 means that these two items are purchased more together than expected based on their individual support values. This means that the association between these two products is not likely to be due to random chance.
We can calculate lift with the following equation:
Support(X&Y) / Support(X) * Support(Y)
def lift(antecedent, consequent): antecedent_support = antecedent.mean() consequent_support = consequent.mean() both_support = np.logical_and(antecedent, consequent).mean() lift = both_support/(antecedent_support*consequent_support) return lift
We calculate leverage with the following formula:
Support(X&Y) – Support(X)*Support(Y)
Leverage is the range between -1 and 1. It has a tighter range than Lift so it is easier to interpret.
If leverage is greater than 0, it is the equivalent of lift being greater than 1, hence there is a positive association.
def leverage(antecedent, consequent):
antecedent_support = antecedent.mean()
consequent_support = consequent.mean()
both_support = np.logical_and(antecedent, consequent).mean()
leverage = both_support - (antecedent_support*consequent_support)
return leverage
The conviction value is a complicated metric that is calculated with the following equation:
( Support(X) * Support(Not Y) ) / Support(X & not Y )
The support of Not Y can be computed by 1 – Support(Y)
If conviction is greater than 1, it implies that there is more evidence that there is a positive association between our products.
def conviction(antecedent, consequent):
antecedent_support = antecedent.mean()
consequent_support = consequent.mean()
both_support = np.logical_and(antecedent, consequent).mean()not_consequent_support = 1 - consequent_support
antecedent_support_not_consequent = not_consequent_support - both_support
conviction = (antecedent_support * not_consequent_support)/antecedent_support_not_consequentreturn conviction
We need to be careful with the amount of items we want in each rule.
Notice that at 4 items per antecedent, we get into 4 million permutations. This makes it almost impossible to evaluate those rules.
The question is, how much valuable information are we missing by not including 4 to 7 item antecedent?
Luckily, there is an algorithm to identify the helpful rules so we can still include these without making an impossible amount of rules to evaluate.
Using the Apriori Algorithm, we will reduce the total count of rules so that we can actually manage them.
The Apriori Algorithm states that if a specific item has low support, all the rules that contain that item will have low support as well and thus aren’t worth evaluating.
rules = apriori(groceries, min_support=0.005, max_len=2, use_colnames=True)
I passed a max length of two as I wanted to see rules using two items since they’re easiest to interpret and check with common sense! You can increase this to find more rules that might be more interesting.
Next, we need to create a rules DataFrame with all our basic metrics while also filtering for significance.
Using the association rules method we imported earlier, we can pass the rules that were filtered from the Apriori method.
To identify our best rules, we will want to execute the following filters.
All of these metrics imply a high association that if someone is going to buy our first product ( the antecedent ), they will buy the next product as well ( the consequent ). The easiest metric to filter by is lift so we’ll create a DataFrame of associations where each rule has a lift higher than 1.0.
rules = association_rules(rules, metric="lift", min_threshold=1.0)
If we print out our rules DataFrame, we’ll see the following data.
print(rules)
The Zhang Metric is a metric that ranges from -1 to 1 to represent both positive association and perfect disassociation. This is a useful metric if there are specific items that you should never put next to each other, even though they have been purchased previously.
This allows us to ensure we are creating the best distance between two items that are dissociated and able to optimize for the most lift across the entire store, not just one object.
def zhangs(antecedent, consequent): rule_confidence = confidence(antecedent, consequent) not_antecedent = ~antecedent not_antecedent_confidence = confidence(not_antecedent, consequent) max_confidence = max(rule_confidence, not_antecedent_confidence) zhang_metric = (rule_confidence-not_antecedent_confidence)/max_confidence return zhang_metric
Sadly, the Zhang metric is not included in our automated rules DataFrame so we’ll need to calculate it ourselves. The method I use is only made for 1 to 1 rules as I’m mostly interested in what purchase should be recommended after a specific product.
rules['zhang'] = rules.apply(lambda row: zhangs(groceries[row['antecedents'][0]], groceries[row['consequents'][0]]), axis=1)
Now that we have plenty of data points that have encoded the relationship between all of our products, we can start interpreting the data.
Since we have made six values, we need to decide which ones are most important to our recommendation system.
While support is valuable to creating all of the metrics, it does not help us understand if a two items have a higher than average association.
Confidence of our rule can be helpful, but only when normalized by subtracting it from the antecedent. If the support for cereal is 40% and the confidence of the rule, if the customer orders milks, they’ll order cereal is 40%, there is no significant difference in the chance that the customer is purchasing cereal if they have already purchased milk. Therefore, confidence isn’t our best metric either.
After working with the data of this grocery set, I believe that leverage, conviction, and Zhang’s metric are the three best metrics for building a recommendation system.
Lift is an effective metric; however, it is not normalized. This leads to products with high frequency of purchases like milk to have massive lifts. Leverage, lift’s sibling metric, conveys the same information as Lift without having massive numbers that add noise to our metrics.
By ranking our rules by lift, we receive a list that indicates the following table. Notice how the first lift that indicates if someone buys mayonnaise, they will buy mustard has a lift of 12.965.
If we sort by leverage, we will see the following table:
By sorting by leverage, we see that the top rule is that if someone buys other vegetables, they are likely to buy root vegetables. What seems odd in this case, is that whole milk implies vegetables. I’m not sure about you, but I’ve never bought vegetables specifically because I have bought milk. This is where conviction may help us filter out the noise.
Conviction is helpful in isolating rules that involve highly purchased products like Milk. Milk can be purchased with almost any item, but that doesn’t always imply their relationship. By using conviction, you make sure to take into account all the other combinations that milk is paired with to weight your association.
Finally, Zhang’s metric is excellent for a very specific use case. Let’s play devil’s advocate and ask, “How much money are we losing when we move a product that is associated with one product away from it to a product is heavily associated with?”
Thinking about this from a physical location perspective with a grocery store helps us understand the opportunity cost of this problem. If we were to move Milk next to Cereal, but away from all of the dairy products, would sales of milk go up?
The answer is that they might. The real question is what is the opportunity cost of moving that milk? If we think of distance as a factor into whether someone sees the milk to remember it, how many smaller products are associated with it that actually may outweigh milk and cereal’s combination?
Zhang’s metric helps us with this. Zhang’s metric gives us a value on how associated products are as well as how disassociated they are. The key is this disassociation value. We can actually see how likely it is that Milk and Beans are never going to be purchased together. Therefore, we have no opportunity cost to switch out Beans for Cereal.
At the end of the day, a great model takes all of these metrics into account to produce the most lift in revenue.
After looking at all of these grocery product metrics, I would trust the following metrics and insights: Lift, conviction, and Zhang’s.
The insights that are obvious from these results are that if someone buys mayonnaise or mustard they will most likely buy the other. If someone buys hamburger meat or instant food products, they may want to buy the other. Obviously, softener and detergent are obvious combinations. These obvious connections confirm that the metrics are accurate. I would look deeper to see if you can find any that are not obvious and test them or validate them with experts.
You can see all the data, make a copy, and sort it on your own on this recommendations spreadsheet.
As always, this is just the start to this problem. Many problems I still would like to solve are the following questions:
These are all excellent questions that could make a big difference, especially in an e-commerce store. Start playing around with these concepts and let me know how they help you position your products!
Copyright 2021 Salestream LLC Sitemap