I have a data set with many missing observations for certain parameters (NA values) in it. I have been performing model selection using AIC. Based on AIC scores I have reduce the model to the form
y = a*b + c
Where a
, b
and c
are continuous dependent variables and y
is my independent variable. However, in this model c
is not significant, and if I remove c
I can now use many more observations from the raw data (many of the missing values are in the c
column). Dropping the c
parameter and using the extra data I find model $R^2$ improves, as does AIC score. However at this point I am comparing apples and oranges. The model with the c
parameter is evaluating the same data set, but with 30 fewer observations.
My questions are:
Is this a legitimate reason to remove
c
from the model? I don't think it is, but if it is, is there a reference for this?Is there a valid way to compare model selection statistics across models that have access to different amounts of data? The different amounts of data are driven by the fact that there are many missing values in the data set.