1

I need some clarification on the undersampling of datasets. I have 3 datasets. Undersampled train data, undersampled validation data, and test dataset which is not undersampled and is the true representation of the population. My questions are:

  1. I am using early stopping to train the model and since model the trains until the validation dataset's performance doesn't improve, should the validation dataset be undersampled, or should represent a true population like the test dataset? I am asking this question because the validation dataset is undersampled and not the true representation of the population. Will the model overfit?
  2. If it is not incorrect to use an undersampled validation dataset, should the undersampling ratio of the target variable be the same in the validation and training data set? What are the disadvantages of having different undersampling ratios?
Shayan Shafiq
  • 633
  • 6
  • 17
RH1
  • 21
  • 1
  • 2
    Statisticians do not see class imbalance as an inherent problem, and there is no need to use undersampling to solve a non-problem. It might be helpful if you said why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Dec 24 '21 at 00:49

0 Answers0