In this tutorial, you’ll learn about** imbalanced data** and how to handle them **in machine learning classification** **in Python**.

Imbalanced data occurs when the classes of the dataset are distributed unequally. It is common for machine learning classification prediction problems.

An extreme example could be when 99.9% of your data set is class A (majority class). At the same time, only 0.1% is class B (minority class). Suppose you throw such data directly into machine learning algorithms. You may find the model ‘ignores’ the minority class and gives wrong predictions of it. It is frustrating since the goal is often to predict the minority class.

So, it is critical to understand and handle the imbalanced problem. Throughout this practical tutorial, you’ll use a highly imbalanced data example, and learn:

- What is imbalanced data in machine learning classification
- How to train and evaluate prediction results
- How to deal with it using 6 techniques:
- Collecting a bigger sample
- Oversampling (e.g., random, SMOTE)

- Undersampling (e.g., random, K-Means, Tomek links)

- Combining over and undersampling

- Weighing classes differently
- Changing algorithms

- Lots more.
- All in Python!

In the end, you should be ready to make better predictions based on your imbalanced data.

Let’s jump in!

- What is imbalanced data in machine learning?
- Evaluation metrics: accuracy pitfall
- Example of an imbalanced dataset
- 1. Collecting a bigger sample (if possible)
- 2. Oversampling
- 3. Undersampling
- 4. Combining Oversampling and Undersampling
- 5. Weighing classes differently
- 6. Changing algorithms
- Which technique to choose?

## What is imbalanced data in machine learning?

Given a dataset with known labels/classes, we can model to predict the class a new observation belongs to. This is called the machine learning **classification **problem. Within it, we have **imbalanced data** when the number of observations across classes is not equal or close to equal.

For example, for a dataset of credit card transactions, there could be 99.9% of legitimate transactions and only 0.1% of fraud. This is a highly imbalanced dataset.

So what is the problem with imbalanced data in machine learning?

While a slight imbalance wouldn’t be a problem, a highly imbalanced dataset could cause issues for our classification predictions. This is because most machine learning algorithms rely on sufficient data. When some of the classes have little data, the algorithm can’t correctly predict its result.

Back to the credit card fraud detection example. Since the fraudulent data is underrepresented, a machine learning algorithm often gives poor predictions for such classes. This is problematic since we want to detect fraudulent transactions and catch them.

Besides credit card fraud detection, other fields also tend to have highly imbalanced datasets. For example:

- Claim prediction/fraud detection in insurance companies
- Spam detection
- Customer churn/conversion prediction

Since machine learning classification could be binary (2-class) or multi-class, the imbalanced data problem could be for both. This tutorial will focus on imbalanced data in machine learning for binary classes, but you could extend the concept to multi-class.

## Evaluation metrics: accuracy pitfall

Before diving into our example, let’s discuss the evaluation metrics. This is a critical choice for an imbalanced dataset.

For classification problems, we often use accuracy as the evaluation metric. It is easy to calculate and intuitive:

Accuracy = # of correct predictions / # of total predictions

But, it is misleading for highly imbalanced datasets. For the example of credit card fraud detection, we can set a model to always classify new transactions as legit. The accuracy could be high at 99.9% if 99.9% in the dataset is all legit.

What an ‘accurate’ model!

But, don’t forget that our goal is to detect fraud, so such a model is useless.

So for the imbalanced dataset, we must look at a broader picture of the prediction results. We could use other evaluation metrics such as Area Under the ROC Curve (AUC), F-score, Precision-Recall Curve.

**Further learning**: to learn about the common evaluation metrics, please check out 8 popular Evaluation Metrics for Machine Learning Models.

In this tutorial, we’ll use **AUC **as the evaluation metric.

It’s a single metric that’s easy to use. AUC has the highest value of 1 when the classifier can predict 100% correctly.

We’ll calculate the AUC of using the original imbalanced dataset, versus the rebalanced datasets. So you can compare them and get an idea of the potential improvement of applying the imbalanced data techniques. Yet, please note that the improvement varies for different datasets or machine learning algorithms.

Now, let’s get to our example of imbalanced data.

## Example of an imbalanced dataset

In this section, we’ll look at our example of an imbalanced dataset. We’ll quickly get it ready for applying the imbalanced data techniques.

The dataset is about abalone. If you’ve never heard of abalone, it is a species of marine snails. Our goal is to identify whether an abalone belongs to a specific class of 19. So this is a binary classification problem of either positive (class 19) or negative.

You can download the data here. It’s a small and straightforward dataset.

### Loading data

First, let’s load and look at the dataset in Python.

Each record is one abalone. There are 4174 rows and 9 columns. The target in this dataset is `Class`

, showing whether the abalone is positive or negative. Besides that, we have features about the abalone, including sex, different sizes, and weight measurements. Among the columns, only `Sex`

and `Class`

are categorical data, and the rest are numerical.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 4174 entries, 0 to 4173 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 4174 non-null object 1 Length 4174 non-null float64 2 Diameter 4174 non-null float64 3 Height 4174 non-null float64 4 Whole_weight 4174 non-null float64 5 Shucked_weight 4174 non-null float64 6 Viscera_weight 4174 non-null float64 7 Shell_weight 4174 non-null float64 8 Class 4174 non-null object dtypes: float64(7), object(2) memory usage: 293.6+ KB

**Further learning**: if you are new to Python, please check out the below resources.

For **Python basics**, check out the FREE Python crash course.

To learn about **Python for data analysis basics**, check out the course Python for Data Analysis with projects.

### Transforming categorical columns

We’ll use the most basic machine learning classification algorithm: logistic regression. It is better to convert all the categorical columns for logistic regression to dummy variables. So we’ll convert the two categorical columns (`Sex`

and `Class`

) within the dataset before modeling.

**Further learning**:

To learn about the **theory of logistic regression**, please check out Logistic Regression for Machine Learning: complete Tutorial.

To learn how to **apply logistic regression in Python**, check out Logistic Regression Example in Python: Step-by-Step Guide.

First, let’s take a look at the categories of `Sex`

and `Class`

.

M 1526 I 1341 F 1307 Name: Sex, dtype: int64 negative 4142 positive 32 Name: Class, dtype: int64

So `Class`

has two categories: negative and positive, while `Sex`

has three categories: Male (M), Infant (I), Female (F).

We can use the below code to convert them:

`map`

the values of`Class`

to 0 (negative) and 1 (positive)

This is simple since`Class`

only has two categories- use the
`get_dummies`

function to convert the`Sex`

column into 2 dummy variables`Sex_I`

,`Sex_M`

.

This is because`Sex`

has three categories. We need 2 dummies to represent it. The`get_dummies`

function generates 3 dummies for each category of`Sex`

(`Sex_F`

,`Sex_I`

,`Sex_M`

), then the`drop_first=True`

argument removes the unnecessary first level dummy variable. You can read an example with detailed explanations here.

Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | Class | Sex_I | Sex_M | |

0 | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.1500 | 0 | 0 | 1 |
---|---|---|---|---|---|---|---|---|---|---|

1 | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.0700 | 0 | 0 | 1 |

2 | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.2100 | 0 | 0 | 0 |

3 | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.1550 | 0 | 0 | 1 |

4 | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.0550 | 0 | 1 | 0 |

… | … | … | … | … | … | … | … | … | … | … |

4169 | 0.560 | 0.430 | 0.155 | 0.8675 | 0.4000 | 0.1720 | 0.2290 | 0 | 0 | 1 |

4170 | 0.565 | 0.450 | 0.165 | 0.8870 | 0.3700 | 0.2390 | 0.2490 | 0 | 0 | 0 |

4171 | 0.590 | 0.440 | 0.135 | 0.9660 | 0.4390 | 0.2145 | 0.2605 | 0 | 0 | 1 |

4172 | 0.600 | 0.475 | 0.205 | 1.1760 | 0.5255 | 0.2875 | 0.3080 | 0 | 0 | 1 |

4173 | 0.625 | 0.485 | 0.150 | 1.0945 | 0.5310 | 0.2610 | 0.2960 | 0 | 0 | 0 |

4174 rows × 10 columns

This is great!

Now, if we look at the `Class`

categories again, you can see that it only has two values of 0 and 1. The value of 1 is what we want to predict (class 19 abalone), and it is only 0.7667% of the dataset. This is certainly a highly imbalanced dataset!

0 0.992333 1 0.007667 Name: Class, dtype: float64

### Splitting training and test sets

One more step before we move on to the imbalanced data techniques and modeling. Let’s split the dataset into training (80%) and test sets (20%). We can use the `train_test_split`

function from `sklearn`

:

- with the
`stratify`

argument based on`Class`

categories. So that both the training and test datasets will have similar portions of classes as the complete dataset. This is important for imbalanced data. - with
`random_state`

set as an integer, so that we’ll get the same result each time running.

Then, we also store the variable `features`

as the column names of the features.

Now we have two sets: `df_train`

and `df_test`

. We’ll use `df_train`

for modeling, and `df_test`

for evaluation.

0 3313 1 26 Name: Class, dtype: int64 0 829 1 6 Name: Class, dtype: int64

Finally, we are ready to try out some techniques to handle imbalanced data in machine learning!

## 1. Collecting a bigger sample (if possible)

This is a helpful technique that always gets overlooked. Expanding the sample helps us to get more information about the minority class.

Yet, this technique is not always feasible.

For our example data, there’s no way for us to collect measurements of more abalones.

But suppose you are working in a credit card company, it is usually easy to expand the sample of one-month transactional data to two or three months. Within a longer period, you may find more fraudulent transactions. Even though the data is still imbalanced, you have more observations of the minority class to apply other techniques below.

## 2. Oversampling

When the data has imbalanced categories, a natural thought is to balance it. We could either increase the number of the minority class or decrease the number of the majority class. This could be done by resampling.

Let’s start with the more popular **oversampling**, which adds examples of the minority class to balance the dataset. Most of the time, we would increase the number of minority classes to be the same as the majority class, reaching the ‘balance’. This makes the impact of different classes balanced in the machine learning algorithms.

There are different methods of oversampling. We’ll cover a few popular ones below:

**Simple random oversampling**: the basic approach of random sampling with replacement from the minority class.**Oversampling with shrinkage**: based on random sampling, adding some noise/shrinkage to disperse the new samples.**Oversampling using SMOTE**: synthesize new samples based on the minority class.

Let’s apply each of these oversampling techniques to our example dataset.

### Simple random oversampling

We’ll begin with simple **random oversampling**. This is a straightforward approach. We simply take copies/samples with replacement from the minority class, until the minority class has the same number of examples as the majority class.

We’ll use two ways to achieve this in Python. One uses the `pandas`

library, which is more transparent so that you can understand the process. The other uses the `imbalanced-learn`

library, which is less code to implement. Both produce the same results. So you can choose either of them.

#### pandas

With the code below, we generate a new balanced training dataset:

- Calculate
`num_to_oversample`

: how many extra copies of the minority class do we need to balance the data - Generate samples from the minority class training data, with replacement
- Concatenate the new sample and the original training dataset

In the end, we have the new training dataset with the two classes balanced: both with 3313 observations.

0 3313 1 3313 Name: Class, dtype: int64

Next, we can apply the logistic regression algorithm to the new balanced dataset `df_train_oversample`

. Again, if you are not familiar with using Python for logistic regression, you can check out Logistic Regression Example in Python: Step-by-Step Guide. But here are the basic steps:

- Instantiate a
`LogisticRegression`

class - Fit using
`df_train_oversample`

- Generate predicted probability for the target class of 1
`y_pred`

- Calculate the AUC metric

The AUC for this model is 0.838962605548854. This is a decent result, but how does it compare to if we use the original dataset? Keep reading, and you’ll find out at the end of the tutorial.

#### imbalanced-learn

Besides `pandas`

, we could also use the library `imbalanced-learn`

to random oversample.

Below we apply `RandomOverSampler`

to `fit_resample`

the training dataset. This gives us `X_resampled`

and `y_resampled`

, the features and the target of the new balanced training set. As you can see, `y_resampled`

has two classes of 0 and 1, being balanced as well.

0 3313 1 3313 Name: Class, dtype: int64

Please note that we’ve set `random_state`

as the same integer as the `pandas`

example, so we generate the same balanced training set. But as you can see, `imblearn`

needs less code to do it.

Then, we can apply logistic regression the same way and calculate the AUC metric. It gives the same AUC of 0.838962605548854 as the `pandas`

method, since we used the same `random_state`

again.

### Oversampling with shrinkage

Now that we are done with basic random sampling, let’s add some noise to the sample. We can do this with the ** shrinkage **parameter in

`imblearn`

.The below code is similar to the previous random sampling example, except for the extra `shrinkage=0.1`

argument.

`shrinkage`

can take values greater than 0. When `shrinkage=0`

, it will be the same as simple random sampling. The larger the `shrinkage`

value, the more noise we add, so the more dispersed the new samples will be. This is useful when we don’t always want to repeat the samples.

By letting it equal to 0.1 below, we make the sampled observations a bit different from the original. You can read more about the `shrinkage`

factor here.

This results in a different sample, but still, a balanced one with both classes having the same number of observations.

0 3313 1 3313 Name: Class, dtype: int64

We can again apply the logistic regression algorithm and check its AUC.

This gives an AUC of 0.8059911540008041. Changing the `shrinkage`

value might improve the result. Please feel free to tune it as you need.

### Oversampling using SMOTE

The last oversampling technique we’ll cover is **SMOTE **(Synthetic Minority Over-sampling TEchnique). It is a more sophisticated technique than the previous ones. Random sampling is easy, but the new samples *don’t add more information* to the machine learning algorithms. SMOTE improves on that.

SMOTE oversamples the minority class by creating ‘synthetic’ examples rather than copies. It involves some methods, including nearest neighbors, to generate plausible new examples. You can read more about it in its original paper.

We can apply SMOTE oversampling through the `imblearn`

library. The process is similar to random oversampling with replacement, but we use the `SMOTE`

class to resample.

Great! The SMOTE oversampling also generates a balanced dataset.

0 3313 1 3313 Name: Class, dtype: int64

As before, we’ll apply logistic regression on the balanced dataset and calculate its AUC.

The AUC is 0.7913148371531966.

That’s all for the oversampling techniques, and we’ll move on to undersampling.

## 3. Undersampling

As you can tell by the name, **undersampling**, we will downsize the majority class to balance with the minority class.

There are also many methods of undersampling. We’ll cover the below popular ones:

**Simple random undersampling**: the basic approach of random sampling from the majority class.**Undersampling using K-Means**: synthesize based on the cluster centroids.**Undersampling using Tomek links**: detects and removes samples from Tomek links.

Let’s apply each of these undersampling techniques to our example dataset.

### Simple random undersampling

We’ll begin with simple **random undersampling**. We take a sample from the majority class, to have the same size as the minority class. So there are risks of removing useful information from the dataset.

We’ll use two ways to achieve this in Python. One uses the `pandas`

library, which is more transparent so that you can understand the process. The other uses the `imbalanced-learn`

library, which is less code to implement. Both can produce the same results. So you can choose either of them.

#### pandas

I won’t go through the process again since it’s similar to oversampling with `pandas`

. But please note that the sampling process is *without* replacement in this example. This results in a new balanced training set: both classes have 26 observations.

0 26 1 26 Name: Class, dtype: int64

We also apply the logistic regression and calculate AUC.

The AUC is 0.6465621230398071. This is much lower than the oversampling techniques. This is because the minority class has a small number of samples. We removed a lot of information when undersampling.

#### imbalanced-learn

The process of `imblearn`

is also similar, but we use the `RandomUnderSampler`

class instead.

0 26 1 26 Name: Class, dtype: int64

And this produces the same AUC as `pandas`

undersampling, since we use the same `random_state`

.

### Undersampling using K-Means

Besides random sampling, we could also use the cluster centroid of the K-Means method as the new sample of the majority class. This means the new sample of the majority class is not the original data anymore. They are synthesized with cluster centroids. So the new sample should be more representative of the actual majority class data. Please read more about it here.

We could, again, use the `imblearn`

library. There’s a `ClusterCentroids`

class. The below code undersamples the majority class to be the same as the minority class, using the cluster centroids.

0 26 1 26 Name: Class, dtype: int64

After applying the logistic regression on the new balanced dataset, we get an AUC of 0.6377161238439888. Again, the value is low due to the loss of information.

### Undersampling using Tomek links

Lastly, let’s look at the Tomek links undersampling approach. This method detects Tomek links and removes samples based on them.

What is the Tomek link?

It is between two samples of different classes. When the two samples are the nearest neighbors of each other, they form a Tomek link.

In our example of the binary classification problem, a Tomek link is a pair of examples from each class that is the closest neighbor across the dataset. After detecting such a link, we could remove data within the pair. Usually, we remove the sample from the majority class to achieve undersampling, i.e., remove the majority class close to the minority class. This removes ambiguity between the two classes.

So, undersampling with Tomek links clean up the overlaps between classes, making them easier to distinguish.

We could use `imblearn`

‘s `TomekLinks`

class to do this. Note that there’s no randomness with Tomek links, so we don’t have the `random_state`

parameter.

By default, Tomek links undersampling only removes the majority class that is close to the minority class. So the data remains unbalanced. In reality, we could combine the Tomek links approach with other techniques. We’ll give an example in the next section.

0 3298 1 26 Name: Class, dtype: int64

With Tomek links undersampling method by itself, the AUC of the logistic regression is 0.683956574185766.

That’s it for undersampling.

Next, we’ll look at the hybrid approach.

## 4. Combining Oversampling and Undersampling

Each of the oversampling and undersampling techniques has its pros and cons. Sometimes it is good to mix and combine their strengths. This section will look at an example of **oversampling using SMOTE and undersampling using Tomek links**.

### SMOTE and Tomek links

The SMOTE oversampling approach could generate noisy samples since it creates synthetic data. To solve this problem, after SMOTE, we could use undersampling techniques to clean up. We’ll use the Tomek links undersampling technique in this example.

Within the `imblearn`

library, there’s the `SMOTETomek`

class that can help.

This results in a balanced dataset with each class of size 3309.

0 3309 1 3309 Name: Class, dtype: int64

Let’s go through the details.

As you may recall, the original training set has class 0 of size 3313 and class 1 of size 26. The `SMOTETomek`

approach first oversampled with SMOTE, which results in a sample with both classes of size 3313. Then the Tomek links technique kicked in and cleaned up the ‘links’, resulting in fewer samples in both classes.

In this `SMOTETomek`

technique, the pair of samples from both the majority and minority classes that form a Tomek link is removed. Therefore, this results in a dataset with two classes of the same size.

The AUC of the logistic regression of the `SMOTETomek`

technique is 0.7913148371531966.

There are also other ways to combine the techniques. You can be creative!

## 5. Weighing classes differently

Besides resampling data, we can also balance the classes by weighing the data differently. As you know, we usually consider each observation equally, with a weight value of 1. But for imbalanced datasets, we can balance the classes by putting more weight on the minority classes.

For example, suppose we want the overall weights of the minority and majority classes to be equal. In that case, we can use the `compute_class_weight`

function from `scikit-learn`

. The below code estimates `weights`

for our imbalanced training dataset.

The variable `weights`

is assigned as an array as below.

array([ 0.50392394, 64.21153846])

This means that if we want the dataset to be balanced, we need to weigh the majority class at 0.50392394 and the minority class at 64.21153846. So a much higher weight for the minority class.

Let’s verify that these weights can indeed balance the dataset.

Using the below code, we multiply the counts of each class by their respective weights.

Both of them give the same number of 1669.5. So by applying these weights, the majority and minority classes would be equally weighted.

1669.5 1669.5000000000002

If we sum up the weights of both classes, it is equivalent to if we just weigh each data by 1.

3339.0 3339

All right! So now you’ve got the idea of how to weigh classes differently. What does this mean for a machine learning algorithm like logistic regression?

The different weights make it cost more to misclassify a minority class than the majority class. This supports our goal of classifying the minority class.

In the `LogisticRegression`

class within `sklearn`

, we can apply different weights to balance the data with the parameter `class_weight`

. So we don’t need to go through the calculation above.

We can use the code below to apply logistic regression to the differently weighted datasets, with the extra argument `class_weight='balanced'`

. The rest of the process is the same.

The AUC of this technique is 0.8275030156815439.

Besides changing the weights of the two classes to balance them, we can also specify custom weights of positive and negative classes. For example, the below code weighs class 1 by 100 times more than class 0.

This returns an AUC of 0.8375552874949739.

## 6. Changing algorithms

We’ve been using the logistic regression algorithm so far. There are other machine learning algorithms that are more tolerant of imbalanced data—for example, decision tree-based models.

**Further learning:**

Decision Tree Model in Machine Learning: Practical Tutorial with Python

Unlocking Random Forest in Machine Learning

We won’t show examples of those algorithms. If you are interested, please take a look at this paper. This survey reviewed and compared different imbalanced data sets, machine learning algorithms, and balancing techniques.

They found out that simple linear algorithms like logistic regression benefitted more from the balancing techniques. While for more complicated models such as random forest and XGBoost, the results were mixed. So, in conclusion, they recommended balancing data for linear models.

For example, for our abalone dataset, the logistic regression applied to the original dataset gives an AUC of 0.683956574185766. This is low compared to many of the balancing techniques we’ve tried. So the balancing techniques did help!

## Which technique to choose?

You’ve learned popular techniques for handling imbalanced data sets in machine learning. Which one is the best?

This is a tricky question.

There’s no rule of thumb. So you need to try different techniques and compare their performance with the original dataset. But here are a couple of tips:

- When the minority class is too small, like our example dataset, undersampling alone is not good.
- When the overall dataset is too large, it’s better to undersample, and then perhaps oversample.

Besides the techniques mentioned in this tutorial, there are also other ones. But these should be a good starting point for you to explore.

Try them out now!

In this post, you’ve learned about popular techniques to handle imbalanced data in machine learning classification.

Hope you now know where to start dealing with your imbalanced data.

We’d love to hear from you. Leave a comment for any questions you may have or anything else.