Multiple imputation and model selection

Question

Multiple imputation is fairly straightforward when you have an a priori linear model that you want to estimate. However, things seem to be a bit trickier when you actually want to do some model selection (e.g. find the "best" set of predictor variables from a larger set of candidate variables - I am thinking specifically of LASSO and fractional polynomials using R).

One idea would be to fit the model in the original data with missing values, and then re-estimate this model in MI datasets and combine estimates as you normally would. However, this seems problematic since you are expecting bias (or else why do the MI in the first place?), which could lead to selecting a "wrong" model from the start.

Another idea would be to go through whatever model selection process you are using in each MI dataset - but how would you then combine results if they include different sets of variables?

One thought I had was to stack a set of MI datasets and analyze them as one large dataset that you would then use to fit a single, "best" model, and include a random effect to account for the fact you are using repeated measures for each observation.

Does this sound reasonable? Or perhaps incredibly naive? Any pointers on this issue (model selection with multiple imputation) would be greatly appreciated.

Please edit this post to change "model fitting" to "model selection". It would also be helpful to discuss which method you're using. For instance, if stepwise model selection based on p-values is used, then stacking imputed data is absolutely NOT allowed. You can draw bootstrap resamples of your data, including missing data apply MI and the subsequent model selection process and calculate an exact "p-value" for the selected model. — AdamO, Dec 30 '12 at 23:41
In your second paragraph, why do you think that method misses the point of multiple imputation? Also, what software are you using? — Peter Flom, Dec 30 '12 at 23:42

score 11 · Accepted Answer · edited Jan 05 '21 at 19:11

11

There are many things you could do to select variables from multiply imputed data, but not all yield appropriate estimates. See Wood et al (2008) Stat Med for a comparison of various possibilities.

I have found the following two-step procedure useful in practice.

Apply your preferred variable selection method independently to each of the $m$ imputed data sets. You will end up with $m$ different models. For each variable, count the number of times it appears in the model. Select those variables that appear in at least half of the $m$ models.
Use the p-value of the Wald statistic or of the likelihood ratio test as calculated from the $m$ multiply-imputed data sets as the criterion for further stepwise model selection.

The pre-selection step 1 is included to reduce the amount of computation. See https://stefvanbuuren.name/fimd/sec-stepwise.html (section 5.4.2) for a code example of the two-step method in R using mice(). In Stata, you can perform Step 2 (on all variables) with mim:stepwise.

edited Jan 05 '21 at 19:11

EdM

57,766
7
66
187

answered Jan 01 '13 at 17:37

Stef van Buuren

2,081
15
13

1

The proposed routine may make sense only when you select from a pre-specified set of regressors. But if I choose say a quadratic trend, 5- and 9-knots B-splines, and may be a CART, I am not sure how to apply this proposal. – StasK Jan 01 '13 at 23:14
Stas, the procedure assumes that the imputation model is correct. In particular the imputation method must adequately capture all features in the data in which you might be interested later on. So if you want to include quadratic terms or B-splines into your complete-data analysis, then the imputation model should be set up in such a way that those features are preserved in the imputed data (Note: this may actually be difficult to achieve, but that's a topic on its own). Given that the imputation model is correctly specified, I would say that the two-step selection procedure applies. – Stef van Buuren Jan 02 '13 at 00:23
Well, then basically the imputation model must be the richest possible model. I have come across situations when that does not quite work out, like perfect predictions in overparameterized logistic models. – StasK Jan 02 '13 at 06:06
Agreed. You'll have to impute under the richest possible model. So, first define the most complex analyses that you would like to do, and tailor the imputation model to that. This might be difficult to achieve in practice, and becomes harder as the complexity of the complete-data model grows. There's no free lunch. Perfect prediction in logistic regression has been solved in a number of ways, and does not need to present a major stumbling block. – Stef van Buuren Jan 02 '13 at 10:23
I now that Patrick Royston's `ice` does something ad hoc (augments the data with artificial observations) to avoid it, and there's also Firth's logistic regression with Jeffrey's prior. What's other stuff? – StasK Jan 02 '13 at 14:13
Section 3.5.2 of my book Flexible Imputation of Missing Data lists six methods, including the two you mentioned. The augmentation method of White et al (2010) CSDA seems to be most widely adopted. – Stef van Buuren Jan 02 '13 at 15:55
See `pool.compare()` in `mice`. – crsh Oct 23 '13 at 10:38
A copy of Wood 2008 for those who don't have paywall access: https://dl.dropboxusercontent.com/u/280585369/2008-wood.pdf ; Wood finds a pooling of steps through Rubin's rules to be best, and http://www.biostat.wisc.edu/Tech-Reports/pdf/tr_217.pdf evaluates that as a 'MI-stepwise' function. – gwern Apr 29 '14 at 21:48
"Use the p-value of the Wald statistic or of the likelihood ratio test as calculated from the mm multiply-imputed data sets as the criterion for further stepwise model selection." I would not recommend this. in "Linear Models with R" (Faraway, 2004) he writes "the p-values used should not be treated too literally. There is so much multiple testing occurring (during model selection) that the validity is dubious. the removal of less significant predictors tends to increase the significance of the remaining predictors." – Alejandro Ochoa Jan 23 '17 at 16:05

score 4 · Answer 2 · answered Jun 03 '13 at 07:10

It is straightforward: You can apply standard MI combining rules - but effects of variables which are not supported throughout imputed datasets will be less pronounced. For example, if a variable is not selected in a specific imputed dataset its estimate (incl. variance) is zero and this has to be reflected in the estimates used when using multiple imputation. You can consider bootstrapping to construct confidence intervals to incorporate model selection uncertainty, have a look at this recent publication which addresses all questions: http://www.sciencedirect.com/science/article/pii/S016794731300073X

I would avoid using pragmatic approaches such as selecting a variable if it is selected in m/2 datasets or sth similar, because inference is not clear and more complicated than it looks at first glance.

Fan Wang · Answer 3 · 2017-02-24T21:03:12.357

I was having the same problem.

My choice was the so-called "multiple imputation lasso". Basically it combines all imputed datasets together and adopts the concept of group lasso: every candidate variable would generate m dummy variables. Each dummy variable corresponds to a imputed dataset.

Then all the m dummy variables are grouped. you would either discard a candidate variable's m dummy variables in all imputed datasets or keep them in all imputed datasets.

So the lasso regression is actually fit on all imputed datasets jointly.

Check the paper:

Chen, Q. & Wang, S. (2013). "Variable selection for multiply-imputed data with application to dioxin exposure study," Statistics in Medicine, 32:3646-59.

And a relevant R program

I think I actually emailed you about this a couple of years ago :) — D L Dahly, Mar 01 '17 at 14:11

score 1 · Answer 4 · answered Dec 31 '12 at 13:51

I've been facing a similar problem -- I've got a dataset in which I knew from the start that I wanted to include all variables (I was interested in the coefficients more than the prediction), but I didn't know a priori what interactions should be specified.

My approach was to write out a set of candidate models, perform multiple imputations, estimate the multiple models, and simply save and average the AIC's from each model. The model specification with the lowest average-of-AIC's was selected.

I thought about adding a correction wherein I penalize between-imputation variance in AIC. On reflection however, this seemed pointless.

The approach seemed straightforward enough to me, but I invented it myself, and I'm no celebrated statistician. Before using it, you may wish to wait until people either correct me (which would be welcome!) or upvote this answer.

Thanks for the reply. Unfortunately what I'm really interested in is using more automated/exploratory methods of model selection that don't lend themselves to first selecting a reasonable set of candidate models. — D L Dahly, Jan 01 '13 at 10:30

Multiple imputation and model selection

4 Answers4

Linked