1

I have a dataset with very skewed distribution (approx. 90 with class 0 and 10 with class 1). I have considered to use undersampling to reduce size of the majority class. I would like to know the right order to undersampling the class and run a cross validation.

LdM
  • 85
  • 7
  • 1
    Why throw away data? – Dave Mar 16 '21 at 02:04
  • well, the alternative would be oversampling. I have a problem of imbalance and I would like to avoid overfitting, so I was considering resampling techniques and one of train/test or cross-valid (k-fold or stratifies) – LdM Mar 16 '21 at 02:37
  • 1
    Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Mar 16 '21 at 09:30
  • Thanks Dave for all the references provided! A doubt that I have is if it might be correct to split data into training and test and then use k-fold cross validation. I read some other material and they mentioned about the use of stratified k-fold cross validation. What do you think to be a good approach? Is it wrong to consider a first split into train/test, then use k-fold instead of stratified k-fold? Undersampling would be applied only to train set – LdM Mar 16 '21 at 09:48
  • Please read the linked material, particularly the first link, to see why class imbalance is not a problem. – Dave Mar 16 '21 at 09:58

0 Answers0