1

Typically for a ML workflow, we import the data (X and y), split the X and y into train, valid and test, preprocess the data for train, valid and test(scale, encode, impute nan values etc), perform HP tuning and after getting the best model with the best HP, we fit the final model to the whole dataset (i.e. X and y).

Now the issue here is that X and y are not preprocessed as only the train, valid and test are preprocessed. So when fitting the final model on X and y, we'll be getting an error as we haven't encoded (and performed other preprocessing steps) X and y. How are we then supposed to train the final model on the whole dataset? Do we preprocess X and y before fitting the final model? And if so won't it lead to data leakage/ overfitting?

Any help will be much appreciated!

spectre
  • 255
  • 8
  • Preprocessing should be part of model development snd testing. You can achieve this easily via sklearn pipelines. – gunes Nov 29 '21 at 11:32
  • The reason I don't use pipelines is because they perform the same preprocessing for all the features. For example I want `StandardScalar` for 2 out of 5 features and for the rest 3 I want `RobustScalar`. But using pipelines, I cannot do that as if I mention `StandardScalar` in the Pipelines, all 5 of the features will be scaled using `StandardScalar`. I have different preprocessing techniques for different features and pipelines doesn't allow that. Also the duplicate question is totally different that my question. So if you have closed this question, kindly open it. – spectre Nov 30 '21 at 05:55
  • I haven’t closed it. You can nominate for reopening. For the scalers, you can override the classes to perform your custom scaling – gunes Nov 30 '21 at 07:35
  • I used scalars just as an example. It could be any preprocessing steps like scaling, encoding, tranformation, imputation etc. Does sklearn have any implementation of such a pipeline? – spectre Nov 30 '21 at 07:37
  • As long as you inherit BaseEstimator and override the fit and transform methods, you can insert anything into a pipeline. – gunes Nov 30 '21 at 08:55

0 Answers0