Classification with imbalanced data

Question

I have a dataset that is highly imbalanced. I did some research on Internet, however I did not find what I was looking for. What is the correct sequence for dealing with imbalanced data?

Should we balance the dataset before cleaning the data or not?

Thanks in advance.

Class imbalance almost certainly is not a problem, and there is no need to use undersampling, oversampling, or artificial balancing to solve a non-problem. It might be helpful if you said why you find the imbalance problematic. Statisticians do not see such a problem. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Dec 10 '21 at 06:14
I try both oversampling and other techniques to dataset. As you said, I get better precision and recall scores from using class weights instead of oversampling. In case of oversampling, the algorithm that I've build lead to overfitting. — Bengu Atici, Dec 13 '21 at 08:52

score 0 · Answer 1 · answered Dec 10 '21 at 08:51

So I have the same issue a few weeks ago. What you want to do is :

first clean your dataframe, drop_duplicates etc etc.
resample the class(es) that has more samples --> If class A has 85% of y and class B the remaining 15%, What you can do is resample class A, by doing this you are going to drop samples from class A but you will get a better ratio between A and B.

min_value = df.target.value_counts()[df.target.value_counts() == df.target.value_counts().min()].item()

# Split the classes between two populations, and with this you can resample the one you want

pop1 = new_df[new_df.target == 0].sample(min_value)
pop2 = new_df[new_df.target == 1]

# With this I downsampled pop2 from over 4000 samples to 1500
pop2 = resample(pop2, replace=False, n_samples=1500)

Hopefully this will help you.

Welcome to Cross Validated! You might be interested in the links I posted. Proper statistical methods handle imbalance just fine, and dropping observations to artificially balance the ratio deprived the model of valuable training data, sacrificed to solve a non-problem. — Dave, Dec 10 '21 at 13:16

score 0 · Answer 2 · answered Dec 10 '21 at 09:33

I think solving the imbalanced data depends on which part of the data you will focus on in the production environment.

For example, a spam email recognizing task: if the spam email being sent to users is more intolerable, that means we need more data with a "spam" label; or vise versa. (To think about the definitions of precision and recalls.)
Another example is image recognition: we want to use DL models to predict an image is a cat or a dog. The importance and error tolerance is equal between two labels, which means the balanced data is what we need.

The process to deal with the data:

Find more real-world data (but not computer-generated data). It is the best way to solve the problem.
If enough data is collected, under-sampling is better.
Try over-sampling or SMOTE method.

Try the module imbalanced-learn. It will help you much.

You might be interested in the links I posted, one of which explicitly deals with spam email detection. Statisticians do not see class imbalance as much of a problem. — Dave, Dec 10 '21 at 13:13

Classification with imbalanced data

2 Answers2