1

Suppose that in my dataset of 100 observations, only 25 have a target variable equal to 1, while the other 75 have target variables equal to 0. Should the portion of target values that are positive affect my choice of size for the train/test split? In other words, should the portion $p$ of the 100 observations that are assigned to the training set be a function of the portion $p'$ of the 100 observations that have positive target values?

Edit: the numbers 100, 25, and 75 are chosen for simplicity, but I am more interested in the general case.

DavidSilverberg
  • 739
  • 6
  • 18
  • 1
    No. That would simply be over-/undersampling, [which is not useful, even for unbalanced datasets (which are usually not a problem)](https://stats.stackexchange.com/q/357466/1352). – Stephan Kolassa Mar 08 '21 at 12:55

0 Answers0