What is the correct way of standardizing data when there are training, validation and test set

Question

When standardizing data before training a neural network, say by subtracting the mean and then dividing by the standard deviation for each variable, there are several ways one could go about that and I am not clear which one is correct/the best:

Get mean and sd of training, validation and test set separately and apply to the respective sets
Get m and sd for combination of train and val set and apply it to both train and val set. Standardize test set separately with its mean and sd
Get m and sd for combination of train and val set and apply it to all three sets

Clearly, one cannot get the mean and sd for the combination of all three sets because there should be no information leak from the test set into the training procedure. What is the right way to go about this and why?

Also, in a regression problem, should the targets be standardized too?

I think standardizing the *entire* dataset would make the most sense. The standardization is not part of the core model building part, so I the information leak (from the test set) should not be a concern here. I'd be interested to know what you/others think. — Vishal, Apr 09 '16 at 17:33
@Vishal. Do you mean standardizing entire dataset with same parameters? The issue then becomes how would you treat completely new data you have not seen before, which parameters do you use to standardize those? To me a test set should be treated exactly like unknown data. Furthermore, I now talked myself into yet another option: use training set parameters to standardize all 3 sets, because val set performance should serve as an unbiased estimator for test set performance. I know this whole question may seem largely irrelevant for practical purposes but there should be a right way of doing it. — salvador, Apr 09 '16 at 17:39

What is the correct way of standardizing data when there are training, validation and test set

0 Answers0