All Subsets Regression with Missing Values

Question

Consider a dataset of 81 observations with 4 measured independent variables and one dependent variable. In this dataset, some of of the observations are missing values for some of the independent variables. There's no pattern to the missing values, but I'm left with only 50 rows of complete data.

In R it can be generated thusly:

set.seed(1001)
X1 = c(-1,0,1)
df = expand.grid(X1=X1,X2=X1,X3=X1,X4=X1)
df$Y = df$X1+3*df$X2-0.5*df$X3+rnorm(81,0,1)
df$X1[sample(1:81,10,replace = FALSE)] = NA
df$X2[sample(1:81,10,replace = FALSE)] = NA
df$X3[sample(1:81,10,replace = FALSE)] = NA
df$X4[sample(1:81,10,replace = FALSE)] = NA

any(apply(is.na(df[,1:4]),1,all))
> FALSE

sum(apply(!is.na(df[,1:4]),1,all))
> 50

If I were to fit a regression model to this data I would use only the 50 complete data points. However, I wish to identify the 'best' model and so I plan to use all subsets regression. That is, I will take each variable and regress it against the response, keeping the best. Then I will take sets of two variables and perform multiple regression, keeping the best, etc...

When I am calculating each of the regression equations, should I:

Use only the 50 complete data points?
Use the set of complete observations as determined by my current subset?

It seems to me that if I'm going fishing I may as well use as many observations as are available to me - after all, if only X1 and X2 are meaningfully correlated with the response then I don't care (anymore) about X3 or X4. However, most software implementations opt for (1) but I don't know if this is because it is computationally easier to extract the subset of observations which are complete and then iterate of the columns or if it is statistically more appropriate to use the same set of observations for each comparison.

EDIT

Additional information. Please assume that if required, any of the variables could be consistently and reliably generated. It's just that up to this point, they've been inconsistently recorded. I could, if desired, design and execute an experiment on all four variables, I would just rather not go through that hassle.

This brings up another point - I really don't care about prediction, I care about explanation. My 'best' model may be the one which minimizes VIFs.

I understand that this approach may not result in me selecting the true 'best' model - I really am only interested in the question I posed, namely, when I am taking this approach, should I use only the complete cases for the whole dataset or should I use the complete cases for the subset being fit.

score 1 · Answer 1 · answered Aug 23 '17 at 00:54

You should rethink your approach to this problem in two ways.

First, although best-subset selection can give you the best predictors for the particular data sample that you have in hand, it typically will not give you the best predictors for future samples from the same population. After you have a complete data set (se below), try repeating your best-subset selection on multiple bootstrap samples of the data and see how variable the "best" subsets can be. In your particular case, with only 4 predictors and 81 cases, there probably is no harm in just keeping them all, particularly if your interest is in predictions based on your model. Predictive models are often best if you include all relevant predictors, even those that aren't "statistically significant." See this superb answer and many others under the feature-selection tag on this site for further discussion.

Second, with large amounts of missing data you need to consider carefully how you are using your model. If you are going to use it for outcome predictions from future observations of new sets of independent variables, you probably should consider and focus on the predictor/independent variables that are most likely to be available in the future. For exploring your present data, removing all except complete-data cases poses a risk of bias. Under reasonable assumptions, you might be better off filling in the missing data by imputation. See the missing-data tag on this site. There are well established ways to approach imputation of missing data, where you use the available data to do your best to predict what the missing values might have been. Multiple imputations, adding some variability into the imputation scheme to generate several different copies of the data set with different values for the originally missing values, is particularly good as it takes into account the variability introduced by the imputation process. See the multiple-imputation tag on this site for more information.

thank you for your answer. I'm actually only interested in an answer to the question I asked, not an alternate solution to the larger problem of how to do the modeling or handle missing data. I'm aware of all the points you've made, I'm just curious from a statistical perspective, which approach to all subsets regression is more appropriate - not if all subsets regression is appropriate. — Mark, Aug 23 '17 at 01:12

All Subsets Regression with Missing Values

1 Answers1