Imputing the mean value from the 'train set' into the 'test set'

Question

I have looked at a couple questions and answers similar to this, the recommendation seems to be the imputation of mean values from the 'training set' into my 'test set'.

However, what I am trying to find out is:

Why would you use the mean value from your 'train set' and not the mean values from your 'test set'?

What is the theory behind this? What would be the effects of using your test mean instead of the right way which is imputing from the train set

Thank you for the help!

sww · Accepted Answer · 2018-05-05T04:14:29.500

1

You assume that train and test come from the same distribution, and hence you use the mean of the train. Also generally training data is more which gives a better estimator of the mean of the distribution. Training is mostly done offline with a lot of data and new test examples can use the estimated mean.

If you use the mean from your test set, you are linking the performance of your procedure to the mean of the new data you are evaluating on. You could be testing on 1 example at a time or 10 or hundreds,so using that as a mean would be more close to the one in training if you have a lot more elements in the test set. Also , for example you use income 100-200 in your train set and test set is made by an adversary who tries to make a prediction for 1000-1200 which does not make sense in your current model.If you use the mean from the test set , the valid test points will be affected by these outliers. But as the training data is large, the effect of outliers in the mean will tend to get averaged out

edited May 05 '18 at 04:14

answered May 05 '18 at 04:02

sww

532
2
11

What would be the consequences of using the mean from the test set to fill missing values? How could this negatively affect my results? Thank you for the help! – Built13 May 05 '18 at 04:09
Hello ! I updated my answer for your question. Please let me know if you any other question – sww May 05 '18 at 04:14
1

So could you say that if on your new data set (test data) you have very different values from what the model was trained on it would affect the results accuracy? So it is based on the assumption that the passed data to the model is mostly similar the one used in the training set. I think this makes sense now. Thank you for the big help! – Built13 May 05 '18 at 04:31

score 0 · Answer 2 · answered May 05 '18 at 05:30

The purpose of a test set is to get an estimate of how you model will perform in the real world, using data that has played no part in your model building.

It is important to realise that the mean becomes part of your model you want to test.

The reason for this is that once a model is deployed 1. you do not want to batch process application of the model, which you would have to do to get an updated mean 2. After mean centring the model describes how the sales deviate from the mean. Changing the mean changes the baseline and therefore renders the model outdated.

Imputing the mean value from the 'train set' into the 'test set'

2 Answers2

Linked