How to handle imbalanced data? Example in R.

Words do not express how much I enjoy the first interactions with the audience of my blog! Since I have launched dedicated Instagram profile, so many of you have decided to reach out, ask for various tips or propose a topic for an article. I have created this blog with the intention to write for myself but the possibility to serve someone else at the same time and add value really takes this activity to the next level. Today’s topic has been requested by one of the readers. If you are reading this post, I hope you will find it useful.

What is imbalanced data?

In simplest words, with very informal definition, imbalanced data is when you have significant difference in the frequency of outcomes. The easiest example of this is when you have binary classification case like fraud detection where most orders are non fraudulent and a very small subset is fraudulent but this minority is extremely important. Another example is churn prediction when you are predicting customer’s unsubscription from a service. As we know, unsubscribing is a very rare behavior.

The class or classes highly represented are called the major or majority classes, whereas the class with few examples (and there is typically just one) is called the minor or minority class.

  • Majority Class: The class (or classes) in an imbalanced classification predictive modeling problem that has many examples.
  • Minority Class: The class in an imbalanced classification predictive modeling problem that has few examples.

What are the consequences of modeling imbalanced datasets?

Imbalanced data poses a challenge for most of the machine learning algorithms used for classification as those have been designed around the assumption of an equal number of examples for each class. This means that a rash application of a model may focus on learning the characteristics of the majority class, neglecting the examples from the minority class that is, in fact, of more interest and whose predictions are more valuable.

Classification evaluation measures give false results for imbalanced data. Thus we need to make it balanced or use ‘Precision/Recall’ performance measure for imbalanced datasets.

Precision

Precision attempts to answer the following question:

What proportion of positive identifications was actually correct?

It’s defined as follows:

 Precision = \frac{TP}{TP + FP}

where TP stands for True Positives and FP for False Positives.

Recall

Precision attempts to answer the following question:

What proportion of actual positives was identified correctly?

Mathematically, recall is defined as follows:

 Recall = \frac{TP}{TP + FN}

where TP stands for True Positives and FP for False Negatives.

To fully evaluate the effectiveness of a model, you must examine both precision and recall.

Accuracy paradox – Why accuracy is fraudulent – Welcome!

Below are the reasons which lead to ML algorithms weakness on imbalanced data sets:

  1. ML algorithms struggle with accuracy because of the unequal distribution for Y variable.
  2. Existing classifiers performance gets biased towards majority class.
  3. The algorithms aim to minimize the overall error where the minority class plays really insignificant part.
  4. ML algorithms assume that the data set has balanced class distributions.
  5. They also assume that errors obtained from different classes have same cost.

What are the methods to deal with imbalanced data sets?

There are two ways to approach the imbalanced data sets problem:

  • acting on data – that is, influencing the initial dataset
  • operating on the cost of a function – modifying class weights, how they affect the model.

I will elaborate on some of the methods below:

Undersampling

In simplest words, this is a method of removing data from the majority class. We are removing random records from the majority sample until both classes have the same cardinality. Imagine you are analyzing a dataset for cancer detection. Most of results for patients should be fine. In the undersampling scenario, we simply take less data from the majority class to help reduce the extent of imbalance in the data set.

Is it a good method? Apparently, removing observations may cause the training data to lose important information typical to the majority class. Thus, the approach is good only when we have enough cases for the minority class. How much is this “enough”. It depends, as usual. 🙂

I fully believe that learning is easier with pictures so let’s try to understand the method by an example. Imagine that there is an imbalanced dataset like the one below with red points representing the majority outcome and blue color for minority class.

Before undersampling

If we have a really big data set and we don’t necessarily need to use the entire data to train our model then what we can do is remove a portion randomly of our majority outcome so that we have a much more balanced data set.

After undersampling

Oversampling

I suppose I won’t surprise you by saying that oversampling is just the opposite of the previous method. This method works with minority class. We add data artificially to our less numerous dataset to balance the number of occurrences in each class. It is also known as upsampling.

An advantage of using this method is that it leads to no information loss. However, it’s also not a perfect method because we need to make sure that adding multiple observations of several cases won’t lead to overfitting. The training accuracy of such data set will be high, but the accuracy on unseen data will be worse.

Before oversampling

So, coming back to our example, standard oversampling keeps entire majority class but repeats samples of the minority set over and over so we would randomly choose one of these data points to train on. We keep adding the points until we have a ratio that makes data more balanced.

After oversampling (repetition of minority class)

Synthetic Minority Oversampling Technique (SMOTE)

Synthetic Data Generation is a type of oversampling technique. It is the solution for the mentioned risk of overfitting. It overcomes the problem by generating artificial data, instead of replicating or adding the observations from the minority class.

To be honest, using oversampling we could also generate similar data from our minority sample. Fortunately, we don’t have to do this, because you can simply use the SMOTE algorithm for this task! The SMOTE method generates new samples between existing data points based on their local density. The SMOTE method was first described in 2002 in a paper by Nitesh Chawl entitled “SMOTE: Synthetic Minority Over-sampling Technique”.

This technique creates new instances of minority group data, copying existing data and making minor changes to it. This makes SMOTE great for amplifying signals that already exist in minority groups, but will not create new signals for these groups.

Learning by example

Let’s make it more visual. Below we can see an example of highly imbalanced data. Red crosses stand for non-fraudulent orders and blue triangles for fraudulent transactions. We want to predict if orders are fraudulent or not and we’re going to use SMOTE to train the model to do that clearly.

Before SMOTE

We need to multiple new synthetic fraudulent cases so that our model can effectively predict the outcome. Starting from choosing two hyperparameters R (the ratio of final data that we want between the minority and majority classes after applying SMOTE) and k (number of nearest neighbors of the point).

For example, if we have 99 non-fraudulent cases and one fraudulent one and we aim to reach R = 0.5, we need to create 99 new synthetic points to have balanced data.

We’re going to iterate over below for loop N many times:

Hyperparameters: R, k

N = points needed for ratio R

for i in range(N):

Step 1: Choose random minority point x

Step 2: Get k nearest neighbors of x.

Step 3: Choose random nn of x,y

Step 4: for each dimension of x:

Step 5: Add x^ to the dataset

Step 1: Choose random minority point x
Step 2: Get k nearest neighbors of x
Step 3: Chose randomly nn for x,y

We need select α from [0,1], so let’s choose 0.75 for x and 0.25 for y.

Steps 4 & 5: Selecting alpha and adding new point to the dataset
Iterate!

Custom Loss Function

The last approach we’ll get familiar with today is to balance data through a custom loss function. It happens when we change the weights in our loss function to account for the imbalance of data.

So if we try to understand this through a normal log loss function:

 -(x * log(p) + (1 - x) * log(1 - p))

where p is our prediction and x is the TRUE label.

The easiest way to understand intuition behind the formula will again be if we actually walk through different examples of the true label and predicted label.

Log loss example
TRUEPREDLOG LOSS FORMULA RESULT
000
01Massive loss
10Massive loss
110

But how do we customize this if we have imbalanced data? It’s actually very easy. The custom loss function is the exact same function with an A added to the first term. A is just the ratio of how we want to balance these two outcomes. Thus, if we care far more about the accurate prediction (whether an order is fraudulent when P = 1), what we would do is to make this A sum value greater than 1. Then if x = 1, we have a truly fraudulent prediction.

The loss term for an inaccurate prediction of our model is going to be much larger then when we have a non-fraudulent order and we inaccurately predict that. What it does is compensate for the fact that we have far fewer fraudulent orders than non-fraudulent ones.

Example with R with fraud detection data

I hope I haven’t overwhelmed you with all the theory. Let’s move to our example which is Credit Card Fraud Detection dataset I have found on Kaggle.

The dataset contains European cardholder transactions that occurred over two days.

# Loading data
data <- data.table::fread("data/creditcard.csv")

The feature descriptions are as follows:

  • Time – the seconds elapsed between each transaction and the first transaction in the dataset
  • V1, V2, …, V28 – principal components obtained through dimensionality reduction (PCA)
  • Amount – the transaction amount
  • Class – the response variable, indicating whether a transaction was fraudulent or not (0 for non-fraudulent case)
Glimpse at data

Using table() function we can easily check the proportions of independent variable in our dataset:

table(data$Class)
Inbalanced data

Woah – this dataset is highly imbalanced: of 284,807 transactions, just 492 (~0.173%) are fraudulent! This poses a challenge with building an effective classifier, as >99% accuracy could be achieved by predicting all transactions as legitimate.

Let’s look a little deeper at our predictors.

predictors <- select(data, -Class)

cbind(melt(apply(predictors, 2, min), value.name = "min"), 
      melt(apply(predictors, 2, max), value.name = "max"))
Variables’ ranges

It seems that before we move further, we should scale our data. Let me perform this operation using min-max method.  Failing to do this will mean that variables over a larger range (e.g. Amount) will dominate the KNN algorithm, as it’s effects when calculating points closest in euclidean distance will be disproportionately high – the nearest neighbors in euclidean distance would essentially be the nearest neighbors in Amount.

predictors_rescaled <- as.data.frame(apply(predictors, 2, rescale))

cbind(melt(apply(predictors_rescaled, 2, min), value.name = "min_after_rescaling"), 
      melt(apply(predictors_rescaled, 2, max), value.name = "max_after_rescaling"))

Don’t forget to bind scaled data to the main dataset:

data <- cbind(Class = data$Class, predictors_rescaled)

Balancing data set using SMOTE and RUS

For the purpose of today’s post, let’s test SMOTE in conjunction with randomized under-sampling (RUS). It will cut down computation time significantly, and can lead to better test-set performance in ROC space than the normal imbalanced data.

SMOTE uses KNN to generate synthetic examples, and the default nearest neighbours is K = 5. I’ll stick to the default value.

The steps SMOTE takes to generate synthetic minority (fraud) samples are as follows:

  1. Choose a minority case: X
  2. Find K nearest neighbors (in euclidean distance, as default) of X: X1, …, X5
  3. Randomly choose one of these K minority neighbors: X4
  4. Generate a random number between 0 and 1: i
  5. Create a synthetic minority observation somewhere ‘between’ the 2 points: X + i(X4 – X)
  6. Repeat dup_size times for each fraud case in the dataset. e.g. for dup_size = 1, SMOTE adds 1 synthetic fraud case for each real fraud case.

Moving back to R, let’s start with the standard way of data sampling:

set.seed(23)

sample <- sample_n(data, 10000)

Now I will generate 4 synthetic minority samples for every 1 legitimate:

sample_smote <- SMOTE(
  X = sample[, -1],
  target = sample$Class,
  dup_size = 4
)
sample_smote_data <- sample_smote$data
sample_smote_data$class <- factor(sample_smote_data$class)
levels(sample_smote_data$class)
table(sample_smote_data$class)

As you can see, the proportion looks like a little bit better:

For majority undersampling, the methodology is pretty intuitive – majority (non-fraud) cases are randomly removed from the dataset until the desired overall volume is reached. My code above has said that the desired overall volume N is 11x the volume of the minority (fraud) cases, so the dataset becomes 10:1, legitimate:fraudulent.

sample_smote_under <- ovun.sample(class ~ .,
  data = sample_smote_data,
  method = "under",
  N = nrow(sample_smote_data[sample_smote_data$class == 1, ]) * 11
)
sample_smote_under_data <- sample_smote_under$data
levels(sample_smote_under_data$class)
sample_smote_under_data$class <- relevel(sample_smote_under_data$class, ref = 1)

As you can see below, now the proportion is totally different!

Let’s use ggplot2 now to put together an illustration of what SMOTE and randomized under-sampling is doing across 2 dimensions (V1 & V2) below, for a random 10 000 sample of the data:

# Changes visualization
p1 <- ggplot(sample, aes(x = V1, y = V2, col = Class)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~Class, labeller = labeller(Class = c("1" = "Fraud", "0" = "Not Fraud"))) +
  labs(
    title = "Before SMOTE",
    subtitle = "10,000 Random Sample",
    col = "Class"
  ) +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.position = "none")

p2 <- ggplot(sample_smote_data, aes(x = V1, y = V2, col = class)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~class, labeller = labeller(class = c("1" = "Fraud", "0" = "Not Fraud"))) +
  labs(
    title = "After SMOTE",
    subtitle = "4 Synthetic Majority Samples (per original minority sample)",
    col = "Class"
  ) +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.position = "none")

p3 <- ggplot(sample_smote_under_data, aes(x = V1, y = V2, col = class)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~class, labeller = labeller(class = c("1" = "Fraud", "0" = "Not Fraud"))) +
  labs(
    title = "After SMOTE & Random Majority Undersampling",
    subtitle = "Reduced majority:minority ratio to 10:1",
    col = "Class"
  ) +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.position = "none")

grid.arrange(p1, p2, p3, nrow = 3)
Various sampling methods impact comparison

Sampling techniques – Performance

Train/Test Split

I’m going to test model performance on the unaltered dataset against 6 different combinations of SMOTE & RUS.

Watch out! We should only be applying the sampling techniques to the training dataset, and model performance should be evaluated on an un-altered test set.

1. The unaltered, highly class-imbalanced data set

train_index <- createDataPartition(sample$Class, p = 0.75, list = FALSE)

train <- sample[train_index, ] # training data (75% of data)
test <- sample[-train_index, ] # testing data (25% of data)

2. A balanced data set with less up-sampling

Dup_size parameter stands for the number or vector representing the desired times of synthetic minority instances over the original number of majority instances. For the less up-sampling approach, I will go with .

smote_v1 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 6) 
smote_train_v1 <- smote_v1$data %>% rename(Class = class)

# under-sample until majority sample size matches
under_v1 <- ovun.sample(Class ~ .,
  data = smote_train_v1,
  method = "under",
  N = 2 * sum(smote_train_v1$Class == 1)
)

train_v1 <- under_v1$data

3. A balanced dataset with more up-sampling

smote_v2 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 29)
smote_train_v2 <- smote_v2$data %>% rename(Class = class)

under_v2 <- ovun.sample(Class ~ .,
  data = smote_train_v2,
  method = "under",
  N = 2 * sum(smote_train_v2$Class == 1)
)

train_v2 <- under_v2$data

4. A fraud-majority dataset with less up-sampling

smote_v3 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 6)
smote_train_v3 <- smote_v3$data %>% rename(Class = class)

under_v3 <- ovun.sample(Class ~ .,
  data = smote_train_v3,
  method = "under",
  N = round(sum(smote_train_v3$Class == 1) * (4 / 3))
)

train_v3 <- under_v3$data

5. A fraud-majority dataset with more up-sampling

smote_v4 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 29)
smote_train_v4 <- smote_v4$data %>% rename(Class = class)

under_v4 <- ovun.sample(Class ~ .,
  data = smote_train_v4,
  method = "under",
  N = round(sum(smote_train_v4$Class == 1) * (4 / 3))
)

train_v4 <- under_v4$data

6. A fraud-minority dataset with less up-sampling

smote_v5 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 6)
smote_train_v5 <- smote_v5$data %>% rename(Class = class)

under_v5 <- ovun.sample(Class ~ .,
  data = smote_train_v5,
  method = "under",
  N = (sum(smote_train_v5$Class == 1) * 4)
)

train_v5 <- under_v5$data

7. A fraud-minority dataset with more up-sampling

smote_v6 <- SMOTE(X = train[, -1], target = train$Class, dup_size = 29)
smote_train_v6 <- smote_v6$data %>% rename(Class = class)

under_v6 <- ovun.sample(Class ~ .,
  data = smote_train_v6,
  method = "under",
  N = (sum(smote_train_v6$Class == 1) * 4)
)

train_v6 <- under_v6$data

The table below summarizes all train datasets:

train_datasets <- list(
  train = train,
  train_v1 = train_v1,
  train_v2 = train_v2,
  train_v3 = train_v3,
  train_v4 = train_v4,
  train_v5 = train_v5,
  train_v6 = train_v6
)

dataset <- 0
obs <- 0
frauds <- 0
frauds_perc <- 0


for (i in 1:7) {
  dataset[i] <- names(train_datasets)[i]
  obs[i] <- nrow(train_datasets[[i]])
  frauds[i] <- sum(train_datasets[[i]]$Class == "one")
  frauds_perc[i] <- frauds[i] / obs[i]
}

(train_datasets_summary <- data.frame(
  name = dataset,
  num_obs = obs,
  frauds = frauds,
  frauds_perc = frauds_perc,
  weighting = c("original (very imbalanced)", "balanced", "balanced", "mostly fraud", "mostly fraud", "mostly non-fraud", "mostly non-fraud"),
  smote_amt = c("none", "some", "lots", "some", "lots", "some", "lots")
))
Sampling techniques summary

Data Modeling

There are several machine learning algorithms we can use for our fraud detection example. I have decided to use Random Forest as it’s relatively easy to apply and explain. I would like to emphasize at this point, that the purpose of this point is not finding the best algorithm or hyperparameter tuning thus I won’t focus on this with our example.

Regarding Random Forest parameters, we will go with the default value for ntree (500) and mtry (number of variables tried at each split) should be the square root of the number of variable or features, thus in our case the closest number to this value is 5.

Full Dataset – train

table(train$Class)

rf.model <- randomForest(Class ~ ., data = train, 
                          ntree = 500,
                          mtry = 5)
print(rf.model)


rf.predict <- predict(rf.model, test)

test$Class <- as.factor(test$Class)

confusionMatrix(test$Class, rf.predict)

Small, Balanced (50:50) – train_v1

table(train_v1$Class)

rf.model_v1 <- randomForest(Class ~ ., data = train_v1,
                            ntrees = 500,
                            mtry = 5)

rf.predict_v1 <- predict(rf.model_v1, test)

test$Class <- as.factor(test$Class)

confusionMatrix(test$Class, rf.predict_v1)

Larger, Balanced (50:50) – train_v2

table(train_v2$Class)

rf.model_v2 <- randomForest(Class ~ ., data = train_v2, 
                            ntrees = 500,
                            mtry = 5)

rf.predict_v2 <- predict(rf.model_v2, test)

confusionMatrix(test$Class, rf.predict_v2)

Small, Fraud-Majority (75:25) – train_v3

table(train_v3$Class)

rf.model_v3 <- randomForest(Class ~ ., data = train_v3, 
                            ntrees = 500,
                            mtry = 5)

rf.predict_v3 <- predict(rf.model_v3, test)

confusionMatrix(test$Class, rf.predict_v3)

Larger, Fraud-Majority (75:25) – train_v4


table(train_v4$Class)

rf.model_v4 <- randomForest(Class ~ ., data = train_v4, 
                            ntrees = 500,
                            mtry = 5)

rf.predict_v4 <- predict(rf.model_v4, test)

confusionMatrix(test$Class, rf.predict_v4)

Smaller, Fraud-Minority (25:75) – train_v5

table(train_v5$Class)

rf.model_v5 <- randomForest(Class ~ ., data = train_v5, 
                            ntrees = 500,
                            mtry = 5)

rf.predict_v5 <- predict(rf.model_v5, test)

confusionMatrix(test$Class, rf.predict_v5)

Larger, Fraud-Minority (25:75) – train_v6

table(train_v6$Class)

rf.model_v6 <- randomForest(Class ~ ., data = train_v6, 
                            ntrees = 500,
                            mtry = 5)

rf.predict_v6 <- predict(rf.model_v6, test)

confusionMatrix(test$Class, rf.predict_v6)

Conclusions

namenum_obsfraudsfrauds_percweightingsmote_amtaccuracy
train213 6063750.0175very imbalancednone0.9996
train_v1525026250.50balancedsome0.9943
train_v222500112500.50balancedlots0.9983
train_v3 350026250.75mostly fraudsome0.968
train_v4 15000112500.75mostly fraudlots0.9933
train_v5 1050026250.25mostly non-fraudsome0.9982
train_v6 45000112500.25 mostly non-fraud lots0.999

To sum up, I have aggregated accuracy for all sets together in one table. What I can say?

  • The balanced datasets and those that maintained the fraud minority seemed to perform slightly better than those where the class imbalance was reversed (to a fraud majority).
  • Regarding SMOTE, large amount of synthetic fraud cases seemed preferable over a smaller number.
  • The optimal balance of data and amount to upsample/downsample is something that will vary for different datasets.
  • There are no big differences in the example, accuracy varies by decimal points but probably if the dataset was bigger, the effect would be more visible.

V’oila! Even in this case we can see that the method may have contributed to the improvement of the metrics.

Summary

I really hope you liked this post. Although each problem is different, please treat this guide as a solid foundation. Don’t be afraid to experiment, test various approaches. Remember – you’ll never learn by watching somebody’s code! You need to try your skills at your own and make all the possible mistakes.

Please let me know what you have enjoyed about this article or maybe what have you found difficult to understand and would like to have explained more in detail? By the way, I think this article will be the beginning of a brand new “How to deal with…?” series on my blog. There are so many problems we need to overcome in data science and machine learning – why don’t we overcome them together? Let me know if you’re into this and what kind of troubles apart from data imbalance you’d like to dive deeper into!

Cheers!

Resources

You can explore the full code available on my GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *