A cohort analysis can be a powerful tool for software companies looking to improve their user retention. It is important to identify if a certain tactic or feature may have reduced or increased retention. In this article, we’re going to talk about how to run a cohort analysis in Python.
I will be using an e-commerce dataset as it is the closest example to a software companies data with the key difference that we will analyzing repeat purchases in a cohort analysis versus retention in subscription.
By running this script on your own data, you will be able to identify key reasons for improved or reduced retention or repeat purchases.
We will use an e-commerce dataset as an example since it’s the closest free dataset that can be found to resemble a transactional software dataset. You can download the dataset from Kaggle to follow along here.
We will need to ensure that this dataset is suitable for what we need. There needs to be enough transactions per user that we can let them resemble a software company.
First, let’s read in the data using “read_csv”. We will pass in the python engine in order to avoid a UTF-8 error.
import pandas as pd data = pd.read_csv('data.csv', engine='python')
It’s always helpful to take a full look at your data. I recommend always making your console columns full-width so you can see all the data. Then you can print out the head() function of the data frame to see the first 5 columns of the dataset. It is also helpful to call the “dtypes” attribute in order to see the types of data in your CSV.
pd.options.display.width = 0
In order to identify if this dataset will be useful, we want to visualize if it has enough transactions. An easy way to do this is by visualizing the transaction count per user with a boxplot using the Seaborn and MatPlotLib package.
import seaborn as sns import matplotlib.pyplot as plt users = data['CustomerID'].value_counts().reset_index() users.columns = ['CustomerID', 'Purchase Count'] sns.boxplot(users['Purchase Count'], sym='') plt.show()
In the code above, we are calling the value counts function on our CustomerID column. This will return a 1-column DataFrame that includes the amount of times that a customer id appears. That customer id will be the index on the DataFrame. You can think of the index as the row identifier. In order to turn this DataFrame into a two column DataFrame, we will reset the index and set the columns to “CustomerID” and “Purchase Count”.
Next, we’ll use the Seaborn library to create a box-plot of every user’s purchase count. We will pass in an empty string for the sym parameter which will remove the outliers above or below the middle 95%.
In the resulting box-plot below, we can see that the median purchase count is about 45, and 75% of the users span from 15 to 100. This is useful as it means we can correctly pretend that this e-commerce dataset is similar to a software dataset.
In order to create the correct cohorts, we will need to turn the timestamps provided into a form that we can understand and sort.
import dateutil from datetime import datetime as dt from pytz import utc data['datetime'] = data['InvoiceDate'].apply(lambda x: dateutil.parser.parse(x).timestamp()) data['month'] = data['datetime'].apply(lambda x: dt.fromtimestamp(x, utc).month) data['day'] = data['datetime'].apply(lambda x: dt.fromtimestamp(x, utc).day) data['year'] = data['datetime'].apply(lambda x: dt.fromtimestamp(x, utc).year) data['hour'] = data['datetime'].apply(lambda x: dt.fromtimestamp(x, utc).hour) data['minute'] = data['datetime'].apply(lambda x: dt.fromtimestamp(x, utc).minute) print(data)
The easiest way to create cohorts is to create monthly cohorts, though weekly cohorts are also possible. In order to sort them in order, we should multiply the year by 100 and add the month to that value. This will create an order that can be sorted correctly.
After we create a column called ‘cohort’, we will group all the orders by the CustomerID and find the minimum cohort to find the cohort they belong in as their first purchase.
We’ll rename the columns to CustomerID and first cohort and merge our new DataFrame with our previous data.
Now, every single order should have a cohort period associated with it that it was purchased within as well as the first purchase of their associated customer.
data['cohort'] = data.apply(lambda row: (row['year'] * 100) + (row['month']), axis=1) cohorts = data.groupby('CustomerID')['cohort'].min().reset_index() print(cohorts) cohorts.columns = ['CustomerID', 'first_cohort'] data = data.merge(cohorts, on='CustomerID', how='left')
We need to create an array of every cohort so that we can use it for the graph and to calculate the distance from the cohort in months. We will call the “value_counts” function on our cohorts column and reset the index. This will result in two columns of all our cohorts and the amount they occur.
We will then sort the entire DataFrame by the “Cohorts” column and turn the “Cohorts” column into a list. This will result in an ordered list of our cohorts. This will be very important when we calculate the cohort distance.
headers = data['cohort'].value_counts().reset_index()
headers.columns = ['Cohorts', 'Count']
headers = headers.sort_values(['Cohorts'])['Cohorts'].to_list()
In order to create a cohort analysis, we need to create a DataFrame that has an index of each user’s first month of making a purchase and the amount of times that the percent that made a purchase in the subsequent months.
There are two ways to create a cohort analysis. You can either do a retention cohort analysis or a returning cohort analysis.
A returning cohort analysis allows for a customer to not have to make a purchase in the periods between to be counted.
A retention cohort analysis needs to be involved in every single period past their first month to be involved in the graph.
By creating a new column called cohort distance, we can create a cohort analysis that looks like a top diagonal. The first row will have the most amount of data as it has had the most amount of time while the last row of data will have no data as it is the most recent purchase.
Our cohort distance will be the amount of months between the current order and the first purchase from the customer. Since we have multiple years, we cannot simply subtract the cohorts as a difference in a year may end up being in the thousands based on how we set up the identifier. Instead, we will use the index function to find where in the array our cohort is. This will return the order. The first month will return a 0 while the next will return a 1 even if they are a December and January combination. This will return a value in our case between 0 and 12.
data['cohort_distance'] = data.apply(lambda row: (headers.index(row['cohort']) - headers.index(row['first_cohort'])) if (row['first_cohort'] != 0 and row['cohort'] != 0)else np.nan, axis=1)
Now that we have computed our cohort distance, we can create a DataFrame that has its rows mark the first purchase cohort while the columns are the months since. The values inside will signify how many customers from that cohort purchases in that period. The first column will always be the largest.
cohort_pivot = pd.pivot_table(data, index='first_cohort', columns='cohort_distance', values='CustomerID', aggfunc=pd.Series.nunique)
The pivot_table method allows us to pivot our data so that we can calculate the amount of unique customers in each cohort based on that purchases distance from their first purchase in months.
Finally, we want to divide each row by the first column so that we can have a percentage of the customers that have returned from that cohort to make a purchase. We can achieve this by using the div function on our entire new DataFrame and using the first Series in the DataFrame. We need to specify the 0 axis so that it will divide by the value in the column for each row.
cohort_pivot = cohort_pivot.div(cohort_pivot,axis=0)
Our end result is a DataFrame that looks like this. Notice that it has a distinct shape as the cohorts go on. This is why it was important for us to use a cohort_distance rather than simply the cohort as it would have the opposite shape.
We can now move onto creating a heatmap out of our cohort analysis.
Finally, we will graph our Heatmap along with making some tweaks to make the graph look better using the Seaborn and PyPlot Library.
First, we need to set our dimensions. I set them at 1200×800 dimension so we can see the heatmap values. We will then use the subplots method to create a figure and axis. This allows us to set our labels and tick marks on the graph.
We will also create labels that are easier to read for humans by reversing the month and year format into a string that is noticeable to the human eye.
We will set all of these values to the axis so that we can make the graph easier to read.
We will use the seaborn library to create a heatmap using the “heatmap” function. We will pass in a palette that helps us identify changes as the cmap parameter. We will also use the mask parameter to make sure that all the null values are erased to make our graph easier to read. We’ll pass in our axises to make sure our changes make a difference. We will also enable annotations so that we can see the percentages on the graph.
Finally, we will call “plt.show()” to show the graph.
import seaborn as sns import matplotlib.pyplot as plt fig_dims = (12, 8) fig, ax = plt.subplots(figsize=fig_dims) y_labels = [str(int(header)%100) + '-' + str(int(header)/100) for header in headers] x_labels = range(0, len(y_labels)) plt.yticks(ticks=headers, labels=y_labels) plt.xticks(x_labels, x_labels) ax.set(xlabel='Months After First Purchase', ylabel='First Purchase Cohort', title="Cohort Analysis") sns.heatmap(cohort_pivot, annot=True, fmt='.0%', mask=cohort_pivot.isnull(), ax=ax, square=True, linewidths=.5, cmap=sns.cubehelix_palette(8)) plt.show()
The resulting graph is shown below.
Now that we have a cohort analysis of our returning purchases, we can discuss how best to analyze it.
Typically, cohort analyses help with showing retention or repeat purchase as a result of time or strategic changes.
In this case, we can see that there is a brief reduction in repeat purchases in the first month of March 2011 from an average of 24% to 19%.
We could look at our quality to identify if we made any changes in the products we sold or the email messaging or advertising we sent to those purchasers.
Typically, a cohort analysis allows us to see whether we are improving in getting customers to make repeat purchases or continue a subscription or it helps us to see if specific efforts in that time period made a noticeable difference.
You could go further to map the heat values of each column to the average to identify at a glance which months were outliers.
You could go even further to run a cohort analysis for individual factors like specific product purchases or demographics to see if they were more likely to have higher retention.
In order to make this cohort analysis work for a software company, you would simply need to add a column that returns True only if there are consecutive purchases such as a subscription. After filter the DataFrame only by orders that return true for this criteria, you can follow the rest of the code.
Download the script yourself to see if you can run it on your own data or do an advanced analysis.