30

I have a data set with N ~ 5000 and about 1/2 missing on at least one important variable. The main analytic method will be Cox proportional hazards.

I plan to use multiple imputation. I will also be splitting into a train and test set.

Should I split the data and then impute separately, or impute and then split?

If it matters, I will be using PROC MI in SAS.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 2
    50% missing values for a crucial variable? Ugh. Rather than impute, why not create a 'Missing' category for the variable? – RobertF Apr 24 '14 at 19:39
  • No one variable has 50% missing, but about 50% is missing on at least one. Also, they are continuous, so "missing" would mess things up. – Peter Flom Apr 24 '14 at 20:05
  • Ah. I get nervous using imputation. I wonder about the merits of having a continuous variable with 50% values imputed vs. converting the cont. variable to categorical with a 'Missing' category plus enough bins to capture the behavior the non-missing values? – RobertF Apr 24 '14 at 20:10
  • I don't like binning continuous variables. – Peter Flom Apr 24 '14 at 20:15

3 Answers3

35

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

Added from comments: if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place

Henry
  • 30,848
  • 1
  • 63
  • 107
  • When you say "you do it the same way on both sets", do you mean: "use the same method to impute missing data in the test set, but NOT the same data"? – colorlace May 05 '18 at 04:22
  • @colorlace Use the past/future analogy. You used the training set in the past, and imputed some values. You now get the test set in the future, and want to impute some of its values; you presumably will use the same method as before applied to the test data (though you are free to incorporate what you learned from the training data) – Henry May 05 '18 at 14:16
  • If you "are free to incorporate what you learned from the training data", then how is that different from just **not** splitting before imputing. – colorlace May 07 '18 at 22:51
  • Are you suggesting: You can inform your test set imputations with training data, but you can't inform training imputations with test data. ? – colorlace May 07 '18 at 22:54
  • 1
    @colorlace: that final point is precisely what I am saying: nothing you do with the training data should be informed by the test data (the analogy is that the future should not affect the past), but what you do do with the test data can be informed by the training data (the analogy is that you can use the past to help predict the future) – Henry May 08 '18 at 19:05
  • I see. So, If we're assuming that the predictor variables are coming from the same distribution -- why not just, upon getting new batch of data to predict on, recompute the imputations [using all available data]? – colorlace May 08 '18 at 19:10
  • 1
    @colorlace - if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place – Henry May 08 '18 at 21:56
  • Henry your last comment should be the first two sentences of your answer. :) – Alexis May 22 '18 at 22:41
  • @Henry I agree you ARE using test data to affect training data (which in turn affects the model). But in this case one could conceivably continue to incorporate all the future data into the imputation calculation - incrementally updating the training data (and thus the model) as we get more data about the distribution of the input vars. – colorlace May 29 '18 at 14:06
  • @colorlace - that only makes sense if you have more new test data you have not seen before, which you might then use to test the revised model. Once you incorporate the original test data into the model, it ceases to be test data and it becomes training data, and so cannot be used for testing. – Henry May 29 '18 at 14:20
2

I think you'd better split before you do imputation. For instances, you may want to impute missing values with column mean. In this case, if you impute first with train+valid data set and split next, then you have used validation data set before you built your model, which is how a data leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reducex the chance of making a mistake. See Pipeline

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
cc458
  • 21
  • 1
0

Just to add on the above I would also favour spliting before imputing or any type of pre-processing. Nothing you do with the training data should be informed by the test data (the analogy is that the future should not affect the past). You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets (the analogy is that you can use the past to help predict the future).

If you use the test data to affect the training data in any way, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk over fitting, and it was to discourage this that you separated out the test data in the first place!

I think the caret package in r is very useful in that setting. I found in specific that post to be extremely helpful https://topepo.github.io/caret/model-training-and-tuning.html

ALEX.VAMVAS
  • 116
  • 7