I am following up on this post Is it okay to compare fitted distributions with the AIC?. I've looked at the document AIC Myths. I still am not satisfied.
Context:
I'm working on a project in R
in a setting where different distributional assumptions can reasonably be made on the response variable. And I'm in a debate concerning model comparison in this setting using AIC. Of course, other robust metrics will be used.
The study looks at the performance of students at various schools on a math exam.
The first fit we consider is the linear fit with percent of students who pass as the response. Now, of course, these percents are discrete in so much as for a given school with n students, there are only n+1 options for possible percentages. But as n for each of the schools is relatively large, this sparseness is not a substantial issue (say n=100, a prediction of 61.5% doesn't feel awful compared to a true 61%). In any case, it turns out the the linear model fits decently well - diagnostic plots are relatively fine with of course some issue at the tails. Moreover, the covariates are such that any reasonable value that they attain do not produce predictions that exceed 0-100%.
The second fit is of course the glm (binomial family), where number of students out of n in each school who pass is the response. We are again estimating the proportion of students who pass the exam.
So, my partner told me to look at AIC to compare these. And AIC still bothers me - hence my post. For the linear model my AIC was nearly 1500 less than that for the glm model, which feels weird. At the end of the day, their predictions are nearly the same. On nearly 1000 data points, the mean of the absolute difference in the predictions for percent (measured as 100*prop) is 1.18 with a standard deviation less than 1. And distributionally, the assumption of binomial response is perhaps more reasonable.
I am aware that the magnitude of difference in AIC values is not really interpretable. I understand that AIC can be used to compare non-nested models (up to inclusion of a normalizing constant).
BUT, unless I'm losing my mind, there is no way that we can compare the MAGNITUDE (yes the y value) of the log likelihood functions evaluated at our MLE beta vector when our response comes from different distributions. Right? After all, given the data, the likelihood function associated with normally distributed rvs is exponential with respect to $\mu$. The log likelihood will be quadratic with respect to a linear combination our coefficient estimates. The likelihood function associated with the binomial distributed rvs is a polynomial function with respect to $\mu$. So, if $\mu$ is exponentially related to a linear combination of the covariates, then the log likelihood will be exponential with respect to a linear combination of our predictor coefficients. I'm sure the y-values of their log likelihood functions are incomparable in general, let alone at a specific value. Is that untrue?
In fact, even less intense, how can you even compare models where the response has been transformed at all, like taking y and performing a composition such as log(y) or 1/y.