How to handle Imbalanced Data in machine learning classification
 With an example in Python

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
imbalanced data machine learning classification
Source: Unsplash

In this tutorial, you’ll learn about imbalanced data and how to handle them in machine learning classification in Python.

Imbalanced data occurs when the classes of the dataset are distributed unequally. It is common for machine learning classification prediction problems.

An extreme example could be when 99.9% of your data set is class A (majority class). At the same time, only 0.1% is class B (minority class). Suppose you throw such data directly into machine learning algorithms. You may find the model ‘ignores’ the minority class and gives wrong predictions of it. It is frustrating since the goal is often to predict the minority class.

So, it is critical to understand and handle the imbalanced problem. Throughout this practical tutorial, you’ll use a highly imbalanced data example, and learn:

  • What is imbalanced data in machine learning classification
  • How to train and evaluate prediction results
  • How to deal with it using 6 techniques:
    • Collecting a bigger sample
    • Oversampling (e.g., random, SMOTE)
    • Undersampling (e.g., random, K-Means, Tomek links)
    • Combining over and undersampling
    • Weighing classes differently
    • Changing algorithms
  • Lots more.
  • All in Python!

In the end, you should be ready to make better predictions based on your imbalanced data.

Let’s jump in!



What is imbalanced data in machine learning?

Given a dataset with known labels/classes, we can model to predict the class a new observation belongs to. This is called the machine learning classification problem. Within it, we have imbalanced data when the number of observations across classes is not equal or close to equal.

For example, for a dataset of credit card transactions, there could be 99.9% of legitimate transactions and only 0.1% of fraud. This is a highly imbalanced dataset.

So what is the problem with imbalanced data in machine learning?

While a slight imbalance wouldn’t be a problem, a highly imbalanced dataset could cause issues for our classification predictions. This is because most machine learning algorithms rely on sufficient data. When some of the classes have little data, the algorithm can’t correctly predict its result.

Back to the credit card fraud detection example. Since the fraudulent data is underrepresented, a machine learning algorithm often gives poor predictions for such classes. This is problematic since we want to detect fraudulent transactions and catch them.

Besides credit card fraud detection, other fields also tend to have highly imbalanced datasets. For example:

  • Claim prediction/fraud detection in insurance companies
  • Spam detection
  • Customer churn/conversion prediction

Since machine learning classification could be binary (2-class) or multi-class, the imbalanced data problem could be for both. This tutorial will focus on imbalanced data in machine learning for binary classes, but you could extend the concept to multi-class.

Evaluation metrics: accuracy pitfall

Before diving into our example, let’s discuss the evaluation metrics. This is a critical choice for an imbalanced dataset.

For classification problems, we often use accuracy as the evaluation metric. It is easy to calculate and intuitive:

Accuracy = # of correct predictions / # of total predictions

But, it is misleading for highly imbalanced datasets. For the example of credit card fraud detection, we can set a model to always classify new transactions as legit. The accuracy could be high at 99.9% if 99.9% in the dataset is all legit.

What an ‘accurate’ model!

But, don’t forget that our goal is to detect fraud, so such a model is useless.

So for the imbalanced dataset, we must look at a broader picture of the prediction results. We could use other evaluation metrics such as Area Under the ROC Curve (AUC), F-score, Precision-Recall Curve.

Further learning: to learn about the common evaluation metrics, please check out 8 popular Evaluation Metrics for Machine Learning Models.

In this tutorial, we’ll use AUC as the evaluation metric.

It’s a single metric that’s easy to use. AUC has the highest value of 1 when the classifier can predict 100% correctly.

We’ll calculate the AUC of using the original imbalanced dataset, versus the rebalanced datasets. So you can compare them and get an idea of the potential improvement of applying the imbalanced data techniques. Yet, please note that the improvement varies for different datasets or machine learning algorithms.

Now, let’s get to our example of imbalanced data.

Example of an imbalanced dataset

In this section, we’ll look at our example of an imbalanced dataset. We’ll quickly get it ready for applying the imbalanced data techniques.

The dataset is about abalone. If you’ve never heard of abalone, it is a species of marine snails. Our goal is to identify whether an abalone belongs to a specific class of 19. So this is a binary classification problem of either positive (class 19) or negative.

You can download the data here. It’s a small and straightforward dataset.

Loading data

First, let’s load and look at the dataset in Python.

Each record is one abalone. There are 4174 rows and 9 columns. The target in this dataset is Class, showing whether the abalone is positive or negative. Besides that, we have features about the abalone, including sex, different sizes, and weight measurements. Among the columns, only Sex and Class are categorical data, and the rest are numerical.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4174 entries, 0 to 4173
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4174 non-null   object 
 1   Length          4174 non-null   float64
 2   Diameter        4174 non-null   float64
 3   Height          4174 non-null   float64
 4   Whole_weight    4174 non-null   float64
 5   Shucked_weight  4174 non-null   float64
 6   Viscera_weight  4174 non-null   float64
 7   Shell_weight    4174 non-null   float64
 8   Class           4174 non-null   object 
dtypes: float64(7), object(2)
memory usage: 293.6+ KB

Further learning: if you are new to Python, please check out the below resources.
For Python basics, check out the FREE Python crash course.
To learn about Python for data analysis basics, check out the course Python for Data Analysis with projects.

Transforming categorical columns

We’ll use the most basic machine learning classification algorithm: logistic regression. It is better to convert all the categorical columns for logistic regression to dummy variables. So we’ll convert the two categorical columns (Sex and Class) within the dataset before modeling.

Further learning:
To learn about the theory of logistic regression, please check out Logistic Regression for Machine Learning: complete Tutorial.
To learn how to apply logistic regression in Python, check out Logistic Regression Example in Python: Step-by-Step Guide.

First, let’s take a look at the categories of Sex and Class.

M    1526
I    1341
F    1307
Name: Sex, dtype: int64

negative    4142
positive      32
Name: Class, dtype: int64

So Class has two categories: negative and positive, while Sex has three categories: Male (M), Infant (I), Female (F).

We can use the below code to convert them:

  • map the values of Class to 0 (negative) and 1 (positive)
    This is simple since Class only has two categories
  • use the get_dummies function to convert the Sex column into 2 dummy variables Sex_I, Sex_M.
    This is because Sex has three categories. We need 2 dummies to represent it. The get_dummies function generates 3 dummies for each category of Sex (Sex_F, Sex_I, Sex_M), then the drop_first=True argument removes the unnecessary first level dummy variable. You can read an example with detailed explanations here.
LengthDiameterHeightWhole_weightShucked_weightViscera_weightShell_weightClassSex_ISex_M
00.4550.3650.0950.51400.22450.10100.1500001
10.3500.2650.0900.22550.09950.04850.0700001
20.5300.4200.1350.67700.25650.14150.2100000
30.4400.3650.1250.51600.21550.11400.1550001
40.3300.2550.0800.20500.08950.03950.0550010
41690.5600.4300.1550.86750.40000.17200.2290001
41700.5650.4500.1650.88700.37000.23900.2490000
41710.5900.4400.1350.96600.43900.21450.2605001
41720.6000.4750.2051.17600.52550.28750.3080001
41730.6250.4850.1501.09450.53100.26100.2960000

4174 rows × 10 columns

This is great!

Now, if we look at the Class categories again, you can see that it only has two values of 0 and 1. The value of 1 is what we want to predict (class 19 abalone), and it is only 0.7667% of the dataset. This is certainly a highly imbalanced dataset!

0    0.992333
1    0.007667
Name: Class, dtype: float64
imbalanced data machine learning classification example Python
Imbalanced classes

Splitting training and test sets

One more step before we move on to the imbalanced data techniques and modeling. Let’s split the dataset into training (80%) and test sets (20%). We can use the train_test_split function from sklearn:

  • with the stratify argument based on Class categories. So that both the training and test datasets will have similar portions of classes as the complete dataset. This is important for imbalanced data.
  • with random_state set as an integer, so that we’ll get the same result each time running.

Then, we also store the variable features as the column names of the features.

Now we have two sets: df_train and df_test. We’ll use df_train for modeling, and df_test for evaluation.

0    3313
1      26
Name: Class, dtype: int64

0    829
1      6
Name: Class, dtype: int64

Finally, we are ready to try out some techniques to handle imbalanced data in machine learning!

1. Collecting a bigger sample (if possible)

This is a helpful technique that always gets overlooked. Expanding the sample helps us to get more information about the minority class.

Yet, this technique is not always feasible.

For our example data, there’s no way for us to collect measurements of more abalones.

But suppose you are working in a credit card company, it is usually easy to expand the sample of one-month transactional data to two or three months. Within a longer period, you may find more fraudulent transactions. Even though the data is still imbalanced, you have more observations of the minority class to apply other techniques below.

2. Oversampling

When the data has imbalanced categories, a natural thought is to balance it. We could either increase the number of the minority class or decrease the number of the majority class. This could be done by resampling.

Let’s start with the more popular oversampling, which adds examples of the minority class to balance the dataset. Most of the time, we would increase the number of minority classes to be the same as the majority class, reaching the ‘balance’. This makes the impact of different classes balanced in the machine learning algorithms.

imbalanced data machine learning classification oversampling
Oversampling

There are different methods of oversampling. We’ll cover a few popular ones below:

  • Simple random oversampling: the basic approach of random sampling with replacement from the minority class.
  • Oversampling with shrinkage: based on random sampling, adding some noise/shrinkage to disperse the new samples.
  • Oversampling using SMOTE: synthesize new samples based on the minority class.

Let’s apply each of these oversampling techniques to our example dataset.

Simple random oversampling

We’ll begin with simple random oversampling. This is a straightforward approach. We simply take copies/samples with replacement from the minority class, until the minority class has the same number of examples as the majority class.

We’ll use two ways to achieve this in Python. One uses the pandas library, which is more transparent so that you can understand the process. The other uses the imbalanced-learn library, which is less code to implement. Both produce the same results. So you can choose either of them.

pandas

With the code below, we generate a new balanced training dataset:

  1. Calculate num_to_oversample: how many extra copies of the minority class do we need to balance the data
  2. Generate samples from the minority class training data, with replacement
  3. Concatenate the new sample and the original training dataset

In the end, we have the new training dataset with the two classes balanced: both with 3313 observations.

0    3313
1    3313
Name: Class, dtype: int64

Next, we can apply the logistic regression algorithm to the new balanced dataset df_train_oversample. Again, if you are not familiar with using Python for logistic regression, you can check out Logistic Regression Example in Python: Step-by-Step Guide. But here are the basic steps:

  1. Instantiate a LogisticRegression class
  2. Fit using df_train_oversample
  3. Generate predicted probability for the target class of 1 y_pred
  4. Calculate the AUC metric

The AUC for this model is 0.838962605548854. This is a decent result, but how does it compare to if we use the original dataset? Keep reading, and you’ll find out at the end of the tutorial.

imbalanced-learn

Besides pandas, we could also use the library imbalanced-learn to random oversample.

Below we apply RandomOverSampler to fit_resample the training dataset. This gives us X_resampled and y_resampled, the features and the target of the new balanced training set. As you can see, y_resampled has two classes of 0 and 1, being balanced as well.

0    3313
1    3313
Name: Class, dtype: int64

Please note that we’ve set random_state as the same integer as the pandas example, so we generate the same balanced training set. But as you can see, imblearn needs less code to do it.

Then, we can apply logistic regression the same way and calculate the AUC metric. It gives the same AUC of 0.838962605548854 as the pandas method, since we used the same random_state again.

Oversampling with shrinkage

Now that we are done with basic random sampling, let’s add some noise to the sample. We can do this with the shrinkage parameter in imblearn.

The below code is similar to the previous random sampling example, except for the extra shrinkage=0.1 argument.

shrinkage can take values greater than 0. When shrinkage=0, it will be the same as simple random sampling. The larger the shrinkage value, the more noise we add, so the more dispersed the new samples will be. This is useful when we don’t always want to repeat the samples.

By letting it equal to 0.1 below, we make the sampled observations a bit different from the original. You can read more about the shrinkage factor here.

This results in a different sample, but still, a balanced one with both classes having the same number of observations.

0    3313
1    3313
Name: Class, dtype: int64

We can again apply the logistic regression algorithm and check its AUC.

This gives an AUC of 0.8059911540008041. Changing the shrinkage value might improve the result. Please feel free to tune it as you need.

Oversampling using SMOTE

The last oversampling technique we’ll cover is SMOTE (Synthetic Minority Over-sampling TEchnique). It is a more sophisticated technique than the previous ones. Random sampling is easy, but the new samples don’t add more information to the machine learning algorithms. SMOTE improves on that.

SMOTE oversamples the minority class by creating ‘synthetic’ examples rather than copies. It involves some methods, including nearest neighbors, to generate plausible new examples. You can read more about it in its original paper.

We can apply SMOTE oversampling through the imblearn library. The process is similar to random oversampling with replacement, but we use the SMOTE class to resample.

Great! The SMOTE oversampling also generates a balanced dataset.

0    3313
1    3313
Name: Class, dtype: int64

As before, we’ll apply logistic regression on the balanced dataset and calculate its AUC.

The AUC is 0.7913148371531966.

That’s all for the oversampling techniques, and we’ll move on to undersampling.

3. Undersampling

As you can tell by the name, undersampling, we will downsize the majority class to balance with the minority class.

imbalanced data machine learning classification undersampling
Undersampling

There are also many methods of undersampling. We’ll cover the below popular ones:

  • Simple random undersampling: the basic approach of random sampling from the majority class.
  • Undersampling using K-Means: synthesize based on the cluster centroids.
  • Undersampling using Tomek links: detects and removes samples from Tomek links.

Let’s apply each of these undersampling techniques to our example dataset.

Simple random undersampling

We’ll begin with simple random undersampling. We take a sample from the majority class, to have the same size as the minority class. So there are risks of removing useful information from the dataset.

We’ll use two ways to achieve this in Python. One uses the pandas library, which is more transparent so that you can understand the process. The other uses the imbalanced-learn library, which is less code to implement. Both can produce the same results. So you can choose either of them.

pandas

I won’t go through the process again since it’s similar to oversampling with pandas. But please note that the sampling process is without replacement in this example. This results in a new balanced training set: both classes have 26 observations.

0    26
1    26
Name: Class, dtype: int64

We also apply the logistic regression and calculate AUC.

The AUC is 0.6465621230398071. This is much lower than the oversampling techniques. This is because the minority class has a small number of samples. We removed a lot of information when undersampling.

imbalanced-learn

The process of imblearn is also similar, but we use the RandomUnderSampler class instead.

0    26
1    26
Name: Class, dtype: int64

And this produces the same AUC as pandas undersampling, since we use the same random_state.

Undersampling using K-Means

Besides random sampling, we could also use the cluster centroid of the K-Means method as the new sample of the majority class. This means the new sample of the majority class is not the original data anymore. They are synthesized with cluster centroids. So the new sample should be more representative of the actual majority class data. Please read more about it here.

We could, again, use the imblearn library. There’s a ClusterCentroids class. The below code undersamples the majority class to be the same as the minority class, using the cluster centroids.

0    26
1    26
Name: Class, dtype: int64

After applying the logistic regression on the new balanced dataset, we get an AUC of 0.6377161238439888. Again, the value is low due to the loss of information.

Undersampling using Tomek links

Lastly, let’s look at the Tomek links undersampling approach. This method detects Tomek links and removes samples based on them.

What is the Tomek link?

It is between two samples of different classes. When the two samples are the nearest neighbors of each other, they form a Tomek link.

In our example of the binary classification problem, a Tomek link is a pair of examples from each class that is the closest neighbor across the dataset. After detecting such a link, we could remove data within the pair. Usually, we remove the sample from the majority class to achieve undersampling, i.e., remove the majority class close to the minority class. This removes ambiguity between the two classes.

imbalanced data Tomek links undersampling
Tomek links undersampling

So, undersampling with Tomek links clean up the overlaps between classes, making them easier to distinguish.

We could use imblearn‘s TomekLinks class to do this. Note that there’s no randomness with Tomek links, so we don’t have the random_state parameter.

By default, Tomek links undersampling only removes the majority class that is close to the minority class. So the data remains unbalanced. In reality, we could combine the Tomek links approach with other techniques. We’ll give an example in the next section.

0    3298
1      26
Name: Class, dtype: int64

With Tomek links undersampling method by itself, the AUC of the logistic regression is 0.683956574185766.

That’s it for undersampling.

Next, we’ll look at the hybrid approach.

4. Combining Oversampling and Undersampling

Each of the oversampling and undersampling techniques has its pros and cons. Sometimes it is good to mix and combine their strengths. This section will look at an example of oversampling using SMOTE and undersampling using Tomek links.

SMOTE and Tomek links

The SMOTE oversampling approach could generate noisy samples since it creates synthetic data. To solve this problem, after SMOTE, we could use undersampling techniques to clean up. We’ll use the Tomek links undersampling technique in this example.

Within the imblearn library, there’s the SMOTETomek class that can help.

This results in a balanced dataset with each class of size 3309.

0    3309
1    3309
Name: Class, dtype: int64

Let’s go through the details.

As you may recall, the original training set has class 0 of size 3313 and class 1 of size 26. The SMOTETomek approach first oversampled with SMOTE, which results in a sample with both classes of size 3313. Then the Tomek links technique kicked in and cleaned up the ‘links’, resulting in fewer samples in both classes.

In this SMOTETomek technique, the pair of samples from both the majority and minority classes that form a Tomek link is removed. Therefore, this results in a dataset with two classes of the same size.

The AUC of the logistic regression of the SMOTETomek technique is 0.7913148371531966.

There are also other ways to combine the techniques. You can be creative!

5. Weighing classes differently

Besides resampling data, we can also balance the classes by weighing the data differently. As you know, we usually consider each observation equally, with a weight value of 1. But for imbalanced datasets, we can balance the classes by putting more weight on the minority classes.

For example, suppose we want the overall weights of the minority and majority classes to be equal. In that case, we can use the compute_class_weight function from scikit-learn. The below code estimates weights for our imbalanced training dataset.

The variable weights is assigned as an array as below.

array([ 0.50392394, 64.21153846])

This means that if we want the dataset to be balanced, we need to weigh the majority class at 0.50392394 and the minority class at 64.21153846. So a much higher weight for the minority class.

Let’s verify that these weights can indeed balance the dataset.

Using the below code, we multiply the counts of each class by their respective weights.

Both of them give the same number of 1669.5. So by applying these weights, the majority and minority classes would be equally weighted.

1669.5
1669.5000000000002

If we sum up the weights of both classes, it is equivalent to if we just weigh each data by 1.

3339.0
3339

All right! So now you’ve got the idea of how to weigh classes differently. What does this mean for a machine learning algorithm like logistic regression?

The different weights make it cost more to misclassify a minority class than the majority class. This supports our goal of classifying the minority class.

In the LogisticRegression class within sklearn, we can apply different weights to balance the data with the parameter class_weight. So we don’t need to go through the calculation above.

We can use the code below to apply logistic regression to the differently weighted datasets, with the extra argument class_weight='balanced'. The rest of the process is the same.

The AUC of this technique is 0.8275030156815439.

Besides changing the weights of the two classes to balance them, we can also specify custom weights of positive and negative classes. For example, the below code weighs class 1 by 100 times more than class 0.

This returns an AUC of 0.8375552874949739.

6. Changing algorithms

We’ve been using the logistic regression algorithm so far. There are other machine learning algorithms that are more tolerant of imbalanced data—for example, decision tree-based models.

Further learning:
Decision Tree Model in Machine Learning: Practical Tutorial with Python
Unlocking Random Forest in Machine Learning

We won’t show examples of those algorithms. If you are interested, please take a look at this paper. This survey reviewed and compared different imbalanced data sets, machine learning algorithms, and balancing techniques.

They found out that simple linear algorithms like logistic regression benefitted more from the balancing techniques. While for more complicated models such as random forest and XGBoost, the results were mixed. So, in conclusion, they recommended balancing data for linear models.

For example, for our abalone dataset, the logistic regression applied to the original dataset gives an AUC of 0.683956574185766. This is low compared to many of the balancing techniques we’ve tried. So the balancing techniques did help!

Which technique to choose?

You’ve learned popular techniques for handling imbalanced data sets in machine learning. Which one is the best?

This is a tricky question.

There’s no rule of thumb. So you need to try different techniques and compare their performance with the original dataset. But here are a couple of tips:

  • When the minority class is too small, like our example dataset, undersampling alone is not good.
  • When the overall dataset is too large, it’s better to undersample, and then perhaps oversample.

Besides the techniques mentioned in this tutorial, there are also other ones. But these should be a good starting point for you to explore.

Try them out now!


In this post, you’ve learned about popular techniques to handle imbalanced data in machine learning classification.

Hope you now know where to start dealing with your imbalanced data.

We’d love to hear from you. Leave a comment for any questions you may have or anything else.

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top

Learn Python for Data Analysis

with a practical online course

lectures + projects

based on real-world datasets

We use cookies to ensure you get the best experience on our website.  Learn more.