1

I have a dataset and I'd like to to use it for classification purposes. There are some columns with NULL values that I need to impute. I want to impute with either median or mean but what I want to know if I should impute that with median/mean before spliting into train and test or I should first split into train and test, then impute with median/mean in train data set and take the value for median/mean from the training data set and apply that to my test data set?

HHH
  • 253
  • 5
  • 15

1 Answers1

0

In a classification context, it's fine to impute values of the independent variables for all cases before the train–test split (so long as your imputation scheme ignores the dependent variable, as mean or median imputation would). The train–test split is only supposed to hide values of the dependent variable, not the independent variables.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68