1

I have a dataset that is highly imbalanced. I did some research on Internet, however I did not find what I was looking for. What is the correct sequence for dealing with imbalanced data?

Should we balance the dataset before cleaning the data or not?

Thanks in advance.

  • 2
    Class imbalance almost certainly is not a problem, and there is no need to use undersampling, oversampling, or artificial balancing to solve a non-problem. It might be helpful if you said why you find the imbalance problematic. Statisticians do not see such a problem. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Dec 10 '21 at 06:14
  • I try both oversampling and other techniques to dataset. As you said, I get better precision and recall scores from using class weights instead of oversampling. In case of oversampling, the algorithm that I've build lead to overfitting. – Bengu Atici Dec 13 '21 at 08:52

2 Answers2

0

So I have the same issue a few weeks ago. What you want to do is :

  1. first clean your dataframe, drop_duplicates etc etc.
  2. resample the class(es) that has more samples --> If class A has 85% of y and class B the remaining 15%, What you can do is resample class A, by doing this you are going to drop samples from class A but you will get a better ratio between A and B.
min_value = df.target.value_counts()[df.target.value_counts() == df.target.value_counts().min()].item()

# Split the classes between two populations, and with this you can resample the one you want

pop1 = new_df[new_df.target == 0].sample(min_value)
pop2 = new_df[new_df.target == 1]

# With this I downsampled pop2 from over 4000 samples to 1500
pop2 = resample(pop2, replace=False, n_samples=1500)

Hopefully this will help you.

  • Welcome to Cross Validated! You might be interested in the links I posted. Proper statistical methods handle imbalance just fine, and dropping observations to artificially balance the ratio deprived the model of valuable training data, sacrificed to solve a non-problem. – Dave Dec 10 '21 at 13:16
0

I think solving the imbalanced data depends on which part of the data you will focus on in the production environment.

  1. For example, a spam email recognizing task: if the spam email being sent to users is more intolerable, that means we need more data with a "spam" label; or vise versa. (To think about the definitions of precision and recalls.)
  2. Another example is image recognition: we want to use DL models to predict an image is a cat or a dog. The importance and error tolerance is equal between two labels, which means the balanced data is what we need.

The process to deal with the data:

  1. Find more real-world data (but not computer-generated data). It is the best way to solve the problem.
  2. If enough data is collected, under-sampling is better.
  3. Try over-sampling or SMOTE method.

Try the module imbalanced-learn. It will help you much.

  • 1
    You might be interested in the links I posted, one of which explicitly deals with spam email detection. Statisticians do not see class imbalance as much of a problem. – Dave Dec 10 '21 at 13:13