Comparing AIC among models with different amounts of data

Question

I have a data set with many missing observations for certain parameters (NA values) in it. I have been performing model selection using AIC. Based on AIC scores I have reduce the model to the form

y = a*b + c

Where a, b and c are continuous dependent variables and y is my independent variable. However, in this model c is not significant, and if I remove c I can now use many more observations from the raw data (many of the missing values are in the c column). Dropping the c parameter and using the extra data I find model $R^2$ improves, as does AIC score. However at this point I am comparing apples and oranges. The model with the c parameter is evaluating the same data set, but with 30 fewer observations.

My questions are:

Is this a legitimate reason to remove c from the model? I don't think it is, but if it is, is there a reference for this?
Is there a valid way to compare model selection statistics across models that have access to different amounts of data? The different amounts of data are driven by the fact that there are many missing values in the data set.

score 6 · Accepted Answer · answered Feb 02 '15 at 02:31

The magnitude of the AIC value is irrelevant; it will always be larger with more data points. AIC is used to compare models based on the exact same data, where the important statistic the the difference between the AIC values. So, in your case, if you remove c from the model and then test against the exact same data, you can compare the two. If you add more data points to your $y = ab$ model, you can no longer compare it to the $y = ab + c$ model.

score 4 · Answer 2 · answered Feb 02 '15 at 03:31

Adding to @Avraham, take a look at the formula for the AIC, which is an intuitive way to see why more or less data points will change the AIC without meaning the model fits better or worse:

$2k-2ln(L)$

k is the number of parameters, ln(L) is the likelihood. The log-likelihood magnitude is based on a summation across all data points. So as you have more data points, your sum will grow.

Comparing AIC among models with different amounts of data

2 Answers2

Linked