57
  1. Do I transform all my data or folds (if CV is applied) at the same time? e.g.

    (allData - mean(allData)) / sd(allData)

  2. Do I transform trainset and testset separately? e.g.

    (trainData - mean(trainData)) / sd(trainData)

    (testData - mean(testData)) / sd(testData)

  3. Or do I transform trainset and use calculations on the testset? e.g.

    (trainData - mean(trainData)) / sd(trainData)

    (testData - mean(trainData)) / sd(trainData)

I believe 3 is the right way. If 3 is correct do I have to worry about the mean not being 0 or the range not being between [0; 1] or [-1; 1] (normalization) of the testset?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
DerTom
  • 737
  • 1
  • 6
  • 10
  • 1
    Is there an ellegant way to code this in `R`? See this question: https://stackoverflow.com/questions/49260862/wanted-trainable-standardscaler-for-r-similar-to-sklearn – Boern Mar 13 '18 at 16:26

1 Answers1

50

The third way is correct. Exactly why is covered in wonderful detail in The Elements of Statistical Learning, see the section "The Wrong and Right Way to Do Cross-validation", and also in the final chapter of Learning From Data, in the stock market example.

Essentially, procedures 1 and 2 leak information about either the response, or from the future, from your hold out data set into the training, or evaluation, of your model. This can cause considerable optimism bias in your model evaluation.

The idea in model validation is to mimic the situation you would be in when your model is making production decisions, when you do not have access to the true response. The consequence is that you cannot use the response in the test set for anything except comparing to your predicted values.

Another way to approach it is to imagine that you only have access to one data point from your hold out at a time (a common situation for production models). Anything you cannot do under this assumption you should hold in great suspicion. Clearly, one thing you cannot do is aggregate over all new data-points past and future to normalize your production stream of data - so doing the same for model validation is invalid.

You don't have to worry about the mean of your test set being non-zero, that's a better situation to be in than biasing your hold out performance estimates. Though, of course, if the test is truly drawn from the same underlying distribution as your train (an essential assumption in statistical learning), said mean should come out as approximately zero.

ayorgo
  • 241
  • 4
  • 10
Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • Thats what I thought. Thank you for clarifying this! – DerTom Sep 30 '15 at 16:23
  • `Clearly, one thing you cannot do is aggregate over all new data-points past and future to normalize your production stream of data`. Why not? – Anmol Singh Jaggi May 07 '16 at 12:32
  • 1
    @AnmolSinghJaggi Its the "and future". If you haven't actually collected the data yet, you cannot normalize using it. – Matthew Drury May 07 '16 at 15:43
  • @MatthewDrury What about the validation set? When training the model, one would have train, validation and test set. In your post, I am not sure if I should use the mean and std for the combination of train and validation to normalize the test set or I should use the mean and std for train to normalize validation and test set. In my opinion, I think I should do the latter, because it also prevents the leakage of info from the validation set into the model, but then the argument can be that validation set is also part of the model and there is no point of preventing the leakage of info. – user10024395 Jul 13 '16 at 01:02
  • @user2675516 Only train. – Matthew Drury Mar 06 '17 at 22:12
  • Nice answer! Unfortunately the first link appears to be down/broken. – Stefan Falk Sep 20 '17 at 07:12
  • I don't get it, in my view, the more data you take the more you're faithfull to data's "true" shape, you can't "overfit" the distribution can you ? Does the way you prescribe yield better result in practical cases ? And if so does it really matter at all unless your data set is very small (like in the hundreds) ? – Moody_Mudskipper Oct 30 '17 at 08:44
  • Why the second way is not right? It only uses information from the predicting variables (X) . And it does not use any information from the response(Y). How does it leak information from the future? – floodking Nov 01 '17 at 18:13
  • 6
    @floodking If you think of training data as "past" data, and testing data as "current or future", by aggregating across your test data you implicitly use information about the *future* of X. Data leakage is not just about leaking $y$ into your predictors, it is also about leaking information from the future. A good rule of thumb is that you should be able to make predictions using *only one row* or your testing data, otherwise you are using the future. – Matthew Drury Nov 01 '17 at 18:17
  • @Moody_Mudskipper Yes, in practice evaluating your model honestly does produce better production results. There are many examples in the literature about 1. or 2. producing models that look good during testing, and then are disasters in production. – Matthew Drury Nov 01 '17 at 18:19
  • 1
    @MatthewDrury. Thanks for your clear explanation. I agree with you now. Only the third way is correct. – floodking Nov 02 '17 at 18:13