Standardize Training and Validation Data

Question

I am supposed to standardize a training and a validation set "so that the training set has zero mean and unit $l_2$-norm". In order to do so I use the data.normalization function from R's ClusterSim package,

    normalizedtrainingset <- data.Normalization(trainingdata,type="n12",normalization="column")

which does do the trick for the training set. Now I am a bit confused about how to proceed with the validation set. I surely cannot use the same function because it would use the validation set's mean and sd and not the mean/sd from the training data. Thus, I proceeded like this:

    validation.standardized <- (validation-mean(training))/sd(training)

This takes into account the mean and sd of the training set. However, the values in the validation set are still quite a bit larger than those of the training data because it has not been normalized. My question now is: do I divide the validationset by its own $l_2$-norm, do I divide it by the $l_2$-norm of the training set or do I not divide it at all and the values in the validation set have to remain larger than those in the training data (the latter seems unlikely).

score 2 · Accepted Answer · answered Nov 29 '16 at 11:13

2

Standardising based on the training mean and variance is the correct approach, because any transformation (including standardisation/normalisation) is part of the model building process, so should be performed after data splitting. It may be that large sample sizes make the difference insignificant but as a general case one should not assume so. See also the answers to the following questions:

Perform feature normalization before or within model validation?

Normalization prior to cross-validation

answered Nov 29 '16 at 11:13

Maximilian Aigner

725
4
12

With regard to the standardization this makes sense. But what about the normalization? All variables and the response of the training set were divided by their $l_2$-norm so that they have unit $l_2$-norm. What do I do about the validation set? Do I divide all variables by the corresponding $l_2$-norm of the training set or do I use the $l_2$-norm of the validation set? – YukiJ Nov 29 '16 at 11:58
1

By 'variance', I meant the $l_2$-norm. You should do exactly the same: use the training set $l_2$-norm to divide the validation set. They will not exactly be of unit norm, but it's the best you can do without 'cheating' by using data from the validation set. – Maximilian Aigner Nov 29 '16 at 12:00
Alright, sorry, I did not get that at first. Thank you for your answer! :-) – YukiJ Nov 29 '16 at 12:18
@YukiJ No worries! – Maximilian Aigner Nov 29 '16 at 13:02

Standardize Training and Validation Data

1 Answers1

Related