I have a dataset with very skewed distribution (approx. 90 with class 0 and 10 with class 1). I have considered to use undersampling to reduce size of the majority class. I would like to know the right order to undersampling the class and run a cross validation.
Asked
Active
Viewed 37 times
1
-
1Why throw away data? – Dave Mar 16 '21 at 02:04
-
well, the alternative would be oversampling. I have a problem of imbalance and I would like to avoid overfitting, so I was considering resampling techniques and one of train/test or cross-valid (k-fold or stratifies) – LdM Mar 16 '21 at 02:37
-
1Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Mar 16 '21 at 09:30
-
Thanks Dave for all the references provided! A doubt that I have is if it might be correct to split data into training and test and then use k-fold cross validation. I read some other material and they mentioned about the use of stratified k-fold cross validation. What do you think to be a good approach? Is it wrong to consider a first split into train/test, then use k-fold instead of stratified k-fold? Undersampling would be applied only to train set – LdM Mar 16 '21 at 09:48
-
Please read the linked material, particularly the first link, to see why class imbalance is not a problem. – Dave Mar 16 '21 at 09:58