Random sampling methods for handling class imbalance

Question

https://www.svds.com/learning-imbalanced-classes/ explains quite nicely the different ways to handle an imbalanced dataset. But there is an information under the random undesampling and random oversampling technique which I am not sure is correct or not, as I could not find the same information in different research articles (paper1,paper2). The information for which I will highly appreciate clarifications are:

1) Does random oversampling the minority class increase the size of the final data set such that each class is the same size as that of the majority?

2) Does random undersampling decrease the total size of the dataset such that each class is the same size as that of the minority?

For example, if the minority class has 20 examples and majority class has 80 examples, then would the result of random oversampling be: (20+80) + 80 = 180

and for the random undersampling technique: 20 + (80-60) = 20+20 =40?

3) Are these methods random sampling with or without replacement?

score 1 · Accepted Answer · answered Jun 17 '18 at 09:15

1

1) Does random oversampling the minority class increase the size of the final data set such that each class is the same size as that of the majority?

It does, however it doesn't have to. Depending on the implementation one could of course also oversample to a size that is bigger than the original majority class if the misclassification costs associated with the problem warrent that.

Same goes for undersampling - the general idea is to balance the data set, so getting them to matching size makes sense unless we have a reason to tip the class balance the other way.

3) Are these methods random sampling with or without replacement?

In the case of the first graphic you posted and assuming our objective is to even out the size of the classes, we will have to sample with replacement unless we synthezise new instances from the known ones, like SMOTE does.

In the case of undersampling both with and without replacement is possible, although I'm only aware of it being used without replacement then.

answered Jun 17 '18 at 09:15

deemel

2,402
4
20
37

Thank you for your answer. I would appreciate for the following clarifications: (1) The picture is for sampling techniques without replacement? Random Under and over sampling without replacement would change the size of the data set ? With replacement would not change the size of the data set?Is my understanding correct? (3) When to use with and without replacement methods?Is there a rule of thumb? (4)Can I use both undersampling and oversampling together? – Srishti M Jun 19 '18 at 15:28
ad 1) Consider this: if you were to sample from the red class in the top half of the graphic without replacement, what would be the maximum size of the sample? ad 3) Both have their culprits, one of the factors to consider is the number of instances. I'm not aware of any general decision rule. ad 4) So you mean undersampling the majority and oversampling the minority, meeting somewhere in the middle? Technically yes, but I've never heard of that being used. – deemel Jun 19 '18 at 15:59
Thank you for your reply. For point 1) It is not clear to me what happens if sampling without replacement of the minority (red class) (Oversampling method) and with replacement. I guess if I want to do without replacement then the class size may not be 50:50 or it will be equal to a user specified ratio. For sampling without replacement then we apply the SMOTE algorithm. This is still confusing to me. Can you please help – Srishti M Jun 19 '18 at 17:47
Sampling with and without replacement works the same here as it does in general probability theory. Imagine randomly grabbing pieces of paper with a number on them out of a hat. Without replacement, i.e. without putting the pieces back in the hat after drawing them, you can only draw as many samples as there are in the hat. Same with the minority class here, so you _have_ to sample with replacement if you want to come anywhere close to the same amount of instances – deemel Jun 19 '18 at 17:51
So for the number of samples of the minority class to be equal to the majority class, I have to do sample with replacement? – Srishti M Jun 19 '18 at 17:55
Yes, that is the case. – deemel Jun 19 '18 at 17:58
1

Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Jul 26 '18 at 06:42
Stephan I agree with your position on this topic, as you might know. The question was directed at how this kind of resampling is performed. While the title of the question suggests it, I did not claim in my answer that it helps against class imbalance or even is the recommended approach to it. Maybe an edit to the questions title would put this in the right context. – deemel Jul 27 '18 at 11:34

Random sampling methods for handling class imbalance

1 Answers1