Consider a dataset of 81 observations with 4 measured independent variables and one dependent variable. In this dataset, some of of the observations are missing values for some of the independent variables. There's no pattern to the missing values, but I'm left with only 50 rows of complete data.
In R it can be generated thusly:
set.seed(1001)
X1 = c(-1,0,1)
df = expand.grid(X1=X1,X2=X1,X3=X1,X4=X1)
df$Y = df$X1+3*df$X2-0.5*df$X3+rnorm(81,0,1)
df$X1[sample(1:81,10,replace = FALSE)] = NA
df$X2[sample(1:81,10,replace = FALSE)] = NA
df$X3[sample(1:81,10,replace = FALSE)] = NA
df$X4[sample(1:81,10,replace = FALSE)] = NA
any(apply(is.na(df[,1:4]),1,all))
> FALSE
sum(apply(!is.na(df[,1:4]),1,all))
> 50
If I were to fit a regression model to this data I would use only the 50 complete data points. However, I wish to identify the 'best' model and so I plan to use all subsets regression. That is, I will take each variable and regress it against the response, keeping the best. Then I will take sets of two variables and perform multiple regression, keeping the best, etc...
When I am calculating each of the regression equations, should I:
- Use only the 50 complete data points?
- Use the set of complete observations as determined by my current subset?
It seems to me that if I'm going fishing I may as well use as many observations as are available to me - after all, if only X1 and X2 are meaningfully correlated with the response then I don't care (anymore) about X3 or X4. However, most software implementations opt for (1) but I don't know if this is because it is computationally easier to extract the subset of observations which are complete and then iterate of the columns or if it is statistically more appropriate to use the same set of observations for each comparison.
EDIT
Additional information. Please assume that if required, any of the variables could be consistently and reliably generated. It's just that up to this point, they've been inconsistently recorded. I could, if desired, design and execute an experiment on all four variables, I would just rather not go through that hassle.
This brings up another point - I really don't care about prediction, I care about explanation. My 'best' model may be the one which minimizes VIFs.
I understand that this approach may not result in me selecting the true 'best' model - I really am only interested in the question I posed, namely, when I am taking this approach, should I use only the complete cases for the whole dataset or should I use the complete cases for the subset being fit.