Optimality of AIC w.r.t. loss functions used for evaluation

Question

Under certain conditions, AIC is an efficient model selection criterion. I understand this roughly as if AIC will tend to select the model that will yield the largest expected likelihood of a new data point from the same data generating process or population (among all models that we are selecting from). This makes AIC the preferred choice if the goal is prediction and the evaluation of predictions is the likelihood.

However, we do not always evaluate prediction accuracy by the likelihood. There are other means of evaluating predictions such as, say, mean squared error (MSE) or mean absolute error (MAE). Questions:

Is AIC still the model selection method of choice if prediction accuracy is evaluated by these loss functions (MSE, MAE)?
What could be a good counterexample, preferably among the well-known loss functions? I.e. what loss function would not favor AIC as the model selection criterion?
How can we characterize the entirety of loss functions for evaluating prediction accuracy that are compatible with AIC being the method of choice for model selection?

For Q2. Is hinge loss considered "fair game"? Essentially any discontinuous loss function is reasonable candidate for a counterexample regarding a metric that is strongly based on a loglikelihood. — usεr11852, Sep 10 '19 at 18:45
@usεr11852, I do not have a strong opinion on this one. If you describe it in more detail and explain the intuition, it could make a good answer. — Richard Hardy, Sep 10 '19 at 19:05
A somewhat related question: [Equivalence of AIC and LOOCV under mismatched loss functions](https://stats.stackexchange.com/questions/406430/). — Richard Hardy, Sep 10 '19 at 19:07

score 5 · Answer 1 · answered Sep 10 '19 at 18:13

5

I think the answer to 1) should be "no", as there is no reason in general to expect that the model which minimizes the expected likelihood will also minimize the MSE, MAE, etc. One might even think of the case in which the likelihood is well defined and the MSE will diverge as the sample size increases (e.g. a distribution with no moments, such as the Cauchy).

I would think that AIC will still be a good method for model selection for any loss function which is a monotone function of expected likelihood, but that's probably a trivial remark that will not help you much.

answered Sep 10 '19 at 18:13

F. Tusell

7,733
19
34

You have answered all of the questions concisely, thank you. To complete the picture, it would be nice if you offered some illustration of loss functions that are monotone w.r.t. expected likelihood. – Richard Hardy Feb 11 '20 at 14:27
Ping........... – Richard Hardy May 10 '21 at 10:10

Stephan Kolassa · Answer 2 · 2021-05-20T15:21:31.587

2

I have to disagree with F. Tusell's answer, which I believe reflects a confusion about what the AIC and other loss functions evaluate.

The AIC evaluates a "modeling" density. (I use quotes around "modeling", to distinguish it from a predictive density, where we would use proper scoring rules for evaluation.) Loss functions like the MAE, the MSE and quantile losses evaluate single number summaries (Kolassa, 2020, IJF) of such modeling (or predictive) densities.

Now, by Stone (1977), the AIC will asymptotically be minimized by the true conditional density (provided it is in the candidate pool; more on this below). Once we have the true conditional density, we can extract the functional from it that minimizes the loss function (the conditional expectation for the MSE, the median for the MAE, the quantile for the quantile loss). Thus, the procedure of "pick the density that minimizes AIC, then extract the appropriate functional for our loss" will asymptotically yield the lowest loss.

Now, all this of course relies on a number of assumptions.

As F. Tusell writes, if the conditional density does not have an expectation, then the "extract the minimum MSE functional" part will not work, so the entire pipeline breaks down. (But if the true DGP follows a Cauchy distribution, what would be the optimal point prediction under the MSE, anyway?)
If the true conditional density is not in our candidate pool of possible models, the asymptotics results of Stone (1977) do not hold. AIC will still asymptotically find the model in the pool with minimal Kullback-Leibler distance from the true DGP, so it would still be a good start - although that one might have a worse expected loss than some other model.

As an example, your data may be $N(0,2)$ distributed, but our model pool may only contain all $N(\mu,1)$ distributions, so the key assumption of Stone (1977) is not satisfied. Assume we are interested in a 90% quantile prediction. The AIC will be optimized in our pool by $N(0,1)$ and report its quantile of $q_{90\%}(0,1)\approx 1.28$. A distribution-free approach that simply optimizes on quantile loss will yield the correct $1.81$. And of course, there is a model in our pool that would yield a lower loss, namely $N(0.54,1)$ with $q_{90\%}(0.54,1)\approx 1.81$, but the AIC won't like that one.
Finally, of course if the functional form differs between fitting and prediction, all bets are off. If you get hit by a world-wide pandemic, your AIC-optimal model based on 2019 data will not be very useful for a (point) forecast for toilet paper demand in Germany in early 2020.
And finally-finally, asymptotics may be a long way off - far enough for the DGP to indeed change, as per the previous bullet point.

edited May 20 '21 at 15:21

answered May 10 '21 at 10:24

Stephan Kolassa

95,027
13
197
357

Thank you! I am very glad you have addressed so many of my questions today! Could you explain/rephrase *if the functional form differs between fitting and prediction*? Also, I am not sure about *Thus, the procedure of "pick the density that minimizes AIC, then extract the appropriate functional for our loss" will asymptotically yield the lowest loss* as this might be at odds with the idea of FIC (focused information criterion). A (perhaps remote) parallell could be made with Efron "Maximum likelihood and decision theory" (1982) and the fact that MLE is often inadmissible in higher dimensions. – Richard Hardy May 10 '21 at 10:50
A tentative thought: the devil might be in your 3rd paragraph. It appears logical on its face but brings about a gut feeling that things might be a bit more complicated than that. I will need more time to think this through. Another thing: *If the true conditional density is not in our candidate pool of possible models, the asymptotics results of Stone (1977) do not hold (but AIC will still asymptotically find the model...)*. Would LOOCV not yield the same model as AIC then, at least if the loss function used in LOOCV equals / is "compatible" with the negative log-likelihood? – Richard Hardy May 10 '21 at 10:58
About the functional form differing, that is just a fancy way of describing a possible structural break in the relationship between the predictors and the outcome (where "predictors" could include the intercept: a level shift). For instance, [you might get hit by a worldwide pandemic](https://stats.stackexchange.com/q/514358/1352). An AIC-optimal model based on 2019 data would not have helped you predict the demand for toilet paper in Germany at the beginning of 2020. – Stephan Kolassa May 10 '21 at 11:06
I'm looking forward to reading up on FIC and figuring out how this ties together - it may simply be a case of faster convergence using the FIC. I agree that my third paragraph makes life look simple. I believe that this is because life in this case *is* simple. I'm looking forward to reading your counterarguments. Regarding LOOCV, [I still have that question on my to-do list](https://stats.stackexchange.com/q/406430/1352) and hope to get around to it some time soon. – Stephan Kolassa May 10 '21 at 11:09
I added a few clarifications and illustrations. I'm looking forward to seeing your thoughts as to where my argument is wrong. – Stephan Kolassa May 10 '21 at 11:54
Thank you so much! I like the example. I do have thoughts (based on FIC, inadmissibility of MLE and other things already mentioned above and in our earlier discussions) but I need some time. By the way, there is an unfinished sentence at the end. – Richard Hardy May 10 '21 at 12:17
@RichardHardy, how are your misgivings coming along? – Stephan Kolassa May 20 '21 at 13:59
Slowly (unfortunately). I am not sure whether it is easier to prove why what I think is wrong is so or to prove what ought to be right. Well, we have a proof of optimality of FIC which seems to contradict your statement, so I take it as a tentative proof that something *is* wrong with the latter. What about your last unfinished sentence? And what exactly do you mean by *the asymptotics results of Stone (1977) do not hold*? (By the way, you got my +1 already, there was no hesitation there; I do appreciate your help.) – Richard Hardy May 20 '21 at 14:38
Sorry, that last unfinished sentence was an atrophied leftover of an edit I then worked into one of the bullet points, where it made more sense. Also, no problem about any +1 or not, I'm not *exclusively* here for the Magic Internet Points. – Stephan Kolassa May 20 '21 at 15:22
Re the asymptotics not holding - that refers to the fact that Stone (1977) presupposes that the true model is in the pool of considered models: $\theta\in\Theta$; [that other question we have been discussing](https://stats.stackexchange.com/q/407291/1352). If we *can't* converge to the true model, because it's outside our box, then of course we *won't* converge to it. – Stephan Kolassa May 20 '21 at 15:27
Regarding FIC, I just downloaded all the JASA papers (it turns out I have had the book since 2010; I should really *read* all my stuff some day). Anyway, it may be that the FIC and AIC-pick-functional are not in conflict at all. I can't promise that I'll be able to disentangle all this, but I do find it fascinating. – Stephan Kolassa May 20 '21 at 15:28
Didn't Stone consider convergence between model selection outcomes of AIC and LOOCV? If so, it is not obvious that the true model being outside the pool of candidates should lead to AIC and LOOCV selection not converging. – Richard Hardy May 20 '21 at 16:46
I have just had a quick read on previous answer and comments, which I cannot contribute to clarify (rather than having missconceptions, I am affraid I lack conceptions!). Just a comment in passing: mention is made of "...AIC will still asymptotically find the model...". I seem to recall that AIC is inconsistent in the sense that no matter how large a sample you take, there is a non-vanishing probability that it will choose a model too large. In the light of this, the argument of Mr. Kolassa might need some rephrasing. – F. Tusell May 23 '21 at 08:57
@F.Tusell, the argument probably has to do with efficiency (rather than consistency) of AIC. – Richard Hardy Oct 13 '21 at 17:18

Optimality of AIC w.r.t. loss functions used for evaluation

2 Answers2

Linked