One of the things I enjoy most in my job is finishing the day with this thrilling feeling that I have learnt something new. It’s quite funny for me now as saying that I have experienced it during university lectures would be a big overstatement… 😉

It’s so beautiful to turn the theory into practice and discover the results in the business world.

One of the projects I am working on right now made it possible for me to learn a new model. It’s even more interesting for me because it’s based on the well-known business rule. **The Pareto Principle**, or **“80/20 Rule”** as it is frequently called today, is a great tool for growing your business. For instance, if you can figure out which 20% of your time produces 80% of your business’ results, you can spend more time on those activities and less time on others. Likewise, by identifying the characteristics of the **top 20% of customers (who represent 80% of total sales)**, you can find more customers like them and grow the company’s sales and profits significantly.

**RFM (recency, frequency, monetary) analysis **is a behavior based technique used to segment customers by examining their transaction history such as:

- how recently a customer has purchased (
**recency**) - how often they purchase (
**frequency**) - how much the customer spends (
**monetary**)

Before I will use the model on the real data, I have played with it on the excercise data and I think it’s a great opportunity to show you my insights and introduce you to this useful technique. As always, feel free the check out the code on my GitHub repo.

Good news is that R has already developed a special package for this kind of model. It’s relatively easy to run so that I’ve decided to … not use it (at least on the beginning!). By building the model from scratch, I wanted to understand better its mechanism. I will play the role of a person driving the project for the owner of a large online store selling unique gifts whose customers are mainly wholesalers. Our aim is to perform a customer behavioral segmentation and draw conclusions based on it. I **do not **evaluate the effectiveness of RFM clustering. I only present the method of carrying it out for sample data set and a selected case!

Without further prolonging… let’s begin.

## Behavioral Customer Segmentation using RFM

**Data set description**(you can download it here)

The collection contains data on transactions with one of the British companies selling small gift ideas and gifts. It works only online, without having a brick-and-mortar store (which I treat as a plus, as all transactions come from the same sales channel). Many of the company’s customers are wholesalers.

All transactions in the file were made between 01/12/2010 and 09/12/2011. The project has been made using **RStudio** and its libraries.

**2. Data preparation **

Using Pareto Principle one more time – **80**% of a **data scientist’s** time is spent collecting, organizing, and **cleansing** the **data**, while only 20% of the time is spent analyzing the **data**. Before I touch RMF modelling, I need to understand the data first.

```
df <- read_excel("OnlineRetail.xlsx")
str(df)
```

First I load the data set, check its size and preview the individual observations to see if everything has been loaded correctly. As you can see, our data frame contains** 8 variables** and **541 909 observations **(before aggregation). Str() function allows us to check the variable types.

```
apply(df, 2, function(x) any(is.na(x)))
summary(df$CustomerID)
```

I have spotted missing values for the **Description** and **CustomerID**. For the first variable it’s quite understandable but we have over 135k customers without the identificator! That’s a lot! What is the reason? Maybe those are just the retail customers which haven’t received an ID? I don’t know. Anyway, we have 2 options:

- treat the “retail” customers without an ID as one client
- remove those transactions from the analysis (there would be around 400k rows left)

Although it’s** 24.93% **of all our observations I have decided to go for the second option. It’s more reasonable from the business perspective at first one could have a negative impact on the further clustering. One big client would be corresponding for around 25% transactions – the rest of customers could be considered as just a noise! I am removing the missing values from the analysis:

`df <- df %>% filter(!is.na(CustomerID))`

The description of the data on the UCI website shows that all invoices with an identifier starting with the letter **“C”** have been** canceled**. There are around 9k of them. Ultimately, we will be only interested in the invoices that haven’t been cancelled that’s why I have also removed them from our data set:

df %>% filter(substr(InvoiceNo, 1, 1) == "C") df <- df %>% filter(substr(InvoiceNo, 1, 1) != "C")

Last but not least, I have removed the** redundant variables**:

`df <- df %>% select(-c(Description, StockCode, Country))`

In RFM clustering, it is usually assumed that only operations from **the last 12 months** are taken into account. The reason for this is simple: older operations can say little about the current situation of the seller, buyer, and the product itself. Of course, it all depends on the industry. Anyway, in our case, the analysis period should be selected in accordance with the requirements and characteristics of the case under consideration:

```
today <- Sys.time()
df <- df %>% filter(InvoiceDate > "2010-12-09 12:50:00")
```

There are **384568** observations left.

**3. RFM modelling **

**a) without rfm package**

As mentioned before first I wanted to understand the model without using the ready package.

So how is the RFM score computed for each customer? For a better introduction to the additional variables which I am to create, please get familiar with below fragment from one of the articles I have found on **R-bloggers**:

A

recency scoreis assigned to each customer based on date of most recent purchase. The score is generated by binning the recency values into a number of categories (default is 5). For example, if you use four categories, the customers with the most recent purchase dates receive a recency ranking of 4, and those with purchase dates in the distant past receive a recency ranking of 1.A

frequencyranking is assigned in a similar way. Customers with high purchase frequency are assigned a higher score (4 or 5) and those with lowest frequency are assigned a score 1.

Monetary scoreis assigned on the basis of the total revenue generated by the customer in the period under consideration for the analysis. Customers with highest revenue/order amount are assigned a higher score while those with lowest revenue are assigned a score of 1.A fourth score,

link to the articleRFM scoreis generated which is simply the three individual scores concatenated into a single value.

I am adding three additional variables to our data set. I cannot do the whole thing in one step because of grouping for frequency and additional variable creation (**today’**s date) for recency.

```
#Adding MonetaryValue
df <- df %>%
mutate(MonetaryValue = Quantity * UnitPrice)
#Adding Recency
df <- df %>%
mutate(Recency = difftime(today, InvoiceDate, unit = "days"))
#Adding Freqency
agg_df <- df %>%
group_by(CustomerID) %>%
summarise(
Recency = min(Recency),
MonetaryValue = sum(MonetaryValue),
Frequency = n()
)
```

The next step is to transform all three variables. The RFM method requires ordinal variables at this stage. This can be done by **the percentiles of the distribution of a given variable**.

```
r <- qcut(agg_df$Recency, cuts = 4)
f <- qcut(agg_df$Frequency, cuts = 4)
m <- qcut(agg_df$MonetaryValue, cuts = 4)
transf_df <- as.data.frame(cbind(r, f, m))
table(transf_df)
```

The labels will be values** 1-4**. When labeling the values of variables, one must remember that a **higher label should mean a better situation** from the seller’s point of view. Therefore, in the case of “Recency” – the lower the value of the variable, the higher the label (we prefer customers who are active). The situation is the opposite in the case of “Frequency” – the higher the value of the variable, the higher the label (we prefer customers who buy more often).

Now it’s high time to **determine the optimal number of clusters**. It’s a fundamental issue in partitioning clustering, such as k-means clustering, which requires the user to specify the number of clusters k to be generated. Similarly in our case.

We will try to define clusters, using **the Elbow method**, such that the **total within-cluster sum of square (WSS)** is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.

The optimal number of clusters can be defined as follow:

*Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.**For each k, calculate the total within-cluster sum of square (wss).**Plot the curve of wss according to the number of clusters k.**The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.*

You can read more about similar methods for optimal cluster number here.

We can use the Elbow method in R as follows:

```
wssplot <- function(data, nc = 20, seed = 1234) {
wss <- (nrow(data) - 1) * sum(apply(data, 2, var))
for (i in 2:nc) {
set.seed(seed)
wss[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:nc, wss,
type = "b", xlab = "Number of Clusters",
ylab = "Within groups sum of squares"
)
}
wssplot(transf_df)
```

That’s how our plot looks like. The Elbow method suggests that the optimal number of clusters in our case is **2**. To be sure, I will check it also by a special NbClust package. It provides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

```
library(NbClust)
set.seed(1234)
nc <- NbClust(transf_df, min.nc = 2, max.nc = 20, method = "kmeans")
table(nc$Best.n[1, ])
```

According to the majority rule, the best number of clusters is 2. Another good type is 4 clusters. Let’s test both solutions.

Let’s test both solutions. First, for 2 clusters:

```
set.seed(1234)
km.res <- kmeans(transf_df, 2, nstart = 25)
km.res
```

According to the table we can divide customers into:

- 1st cluster: potential loyalist recent customers, spent good amount, bought more than once
- 2nd cluster: at risk customers

It’s also possible to compute the mean of each variables by clusters using the original data:

`aggregate(agg_df, by = list(cluster = km.res$cluster), mean) %>% select(-CustomerID)`

And now for** 4 **clusters:

```
set.seed(1234)
km.res <- kmeans(transf_df, 4, nstart = 25)
km.res
aggregate(agg_df, by = list(cluster = km.res$cluster), mean) %>% select(-CustomerID)
```

K-means clustering with 4 clusters of sizes 1361, 790, 774, 1345.

According to the table we can divide customers into:

- 1st cluster: loyal customers, spend good money, responsive to promotions
- 2nd cluster: at risk, spent big money, purchased often but long time ago
- 3rd cluster: potential loyalist recent customers, spent good amount, bought more than once
- 4th cluster: hibernating low spenders, low frequency, purchased long time ago

How many clusters should I choose? 2 or 4? It’s up to you. It depends on the kind of actions you want to take per each customer group. **Anyway, regardless how many clusters you take, it’s worth to remember to consider their number already at the stage of building the model.** In my case i would go for 4 clusters as it gives me the better understanding of my customer types.

**b) using rfm package**

I will not go much in detail with the package usage as my article is already one thousand kilometers long. Please get familiar with the code and results if you are interested (or not asleep 🙂 ).

```
rfm_data <- df %>% select(InvoiceNo, InvoiceDate, CustomerID, UnitPrice, Quantity)
rfm_data <- rfm_data %>%
group_by(CustomerID) %>%
mutate(Amount = Quantity * UnitPrice) %>%
select(-c(InvoiceNo, Quantity, UnitPrice))
#Customer segmentation
rfm_result <- rfm_table_order(rfm_data, CustomerID, InvoiceDate, Amount, today)
rfm_result
segment_names <- c(
"Champions", "Loyal Customers", "Potential Loyalist",
"New Customers", "Promising", "Need Attention", "About To Sleep",
"At Risk", "Can't Lose Them", "Lost"
)
recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
segments <- rfm_segment(
rfm_result, segment_names, recency_lower, recency_upper,
frequency_lower, frequency_upper, monetary_lower, monetary_upper
)
segments %>%
count(segment) %>%
arrange(desc(n)) %>%
rename(Segment = segment, Count = n)
```

Our customers are grouped in **8 classes **right now. Ideally, we should have very few or no customer in segments such as At Risk or Needs Attention. You can find our defined distribution below:

You can get more insight into the results by using various **inbuilt visualization functions**:

- the
**median**recency, frequency and monetary value across segments to ensure that the logic used for customer classification is sound and practical - the
**heat map**shows the average monetary value for different categories of recency and frequency scores. Higher scores of frequency and recency are characterized by higher average monetary value as indicated by the darker areas in the heatmap **rfm_bar_chart()**to generate the distribution of monetary scores for the different combinations of frequency and recency scores**rfm_histograms()**to examine the relative distribution of frequency, monetary value and recency**rmf_order_dist()**to visualize the distribution of customers across orders**scatterplots**to examine the relationships

```
rfm_plot_median_recency(segments)
rfm_plot_median_frequency(segments)
rfm_plot_median_monetary(segments)
rfm_heatmap(rfm_result)
rfm_bar_chart(rfm_result)
rfm_histograms(rfm_result)
rfm_order_dist(rfm_result)
rfm_rm_plot(rfm_result)
rfm_fm_plot(rfm_result)
rfm_rf_plot(rfm_result)
```

For example:

You will find more information about the package in the R-bloggers article I have linked before.

Summing up, thank you for getting to the end of this post. I hope you liked it. If you have any comments, questions or observations, feel free to let me know in the comments below.

## 1 Comment