Standardization on training only or also including testing data?

Question

My question is very much related to this one:

How to apply standardization/normalization to train- and testset if prediction is the goal?

However, my testing data is not a single observation that I want to predict on but rather it is a set of new observations. My training set is about 300 observations and my testing set about 30. So using method 2 from the above link would not be good as it would include bias (sample size of 30 is not large enough to give a good representation of the data distribution).

In that case, it seems to me that method 1 is preferable over method 3? Or am I missing something.

Frans Rodenburg · Answer 1 · 2019-11-13T02:59:50.523

When you train a model, you should consider any standardization/scaling to be part of the model training and thus use the estimates from the train set (e.g. sample average, standard deviation).

You can never use train+test set during model training because this leaks information from your 'unseen' data. This includes scaling (however subtle the effect).

Even if for some reason your test set or future data is larger than the train set, you should still use the sample estimates of the train set of whatever statistic you need for scaling, because you trained a model to perform well when scaled to these estimates. (Of course, if your test set really is considerably larger than your train set, you may want to consider retraining the model.)

The only exception is normalization using a well-defined minimum and maximum: If your data are 8-bit color intensities ranging from $0$ to $255$, then past, current and all future data will have the same minimum and maximum. In such cases, you should neither use your train set, nor your test set to estimate the extremes, but go with the known values instead.

+1 In short: pretend the test data don’t exist when you train a model. — Dave, Nov 13 '19 at 03:03

Standardization on training only or also including testing data?

1 Answers1