Equivalence of AIC and LOOCV under mismatched loss functions

Question

Under certain conditions, AIC and LOOCV (leave-one-out cross validation) are asymptotically equivalent (Stone, 1977). Stone's paper is less than 4 pages long, but quite mathy, so I turn here for some assistance. I presume the equivalence holds when the loss function in LOOCV is exactly the same or somehow compatible with the loss function implied by the likelihood as used in the AIC.

Questions

What happens if the loss function employed in LOOCV does not exactly correspond to the loss function implied by the likelihood in AIC?
For example, say that the loss function in LOOCV is some form of tick function (quantile loss) while the likelihood is normal?
Under what conditions, at least roughly, would the asymptotic equivalence hold?
Simple, concrete examples as well as rigorous explications are welcome.

References

Stone, M. (1977). An asymptotic equivalence of choice of model by cross‐validation and Akaike's criterion. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 44-47.

Related question: [Is this a typo in Stone's (1977) paper on asymptotic equivalence between AIC and LOOCV?](https://stats.stackexchange.com/questions/407286/). — Richard Hardy, May 08 '19 at 14:40
Related question: [Example and counterexample for Stone's (1977) assumption](https://stats.stackexchange.com/questions/407291/). — Richard Hardy, May 08 '19 at 15:00
Somewhat related question: [Optimality of AIC w.r.t. loss functions used for evaluation](https://stats.stackexchange.com/questions/425675/). — Richard Hardy, Sep 10 '19 at 19:08
Hm. Aren't we comparing apples to oranges here? Information criteria assess entire likelihoods, and quantile losses assess point forecasts. So the idea would be to obtain a predictive density, then extract the quantile from that, and to assess this using the quantile loss. I see no conflict. An (IMO) more interesting question would be to compare AIC to cross-validation using density predictions assessed through proper scoring rules. What do you think? — Stephan Kolassa, May 10 '21 at 08:31
@StephanKolassa, on the one hand, probably so. On the other hand, negative log-likelihood (the basis of AIC) is a loss function. Quantile loss is also a loss function. In that sense, both look like apples to me. Moreover, information criteria are typically employed for selecting models that will be used to produce points forecasts. In such cases one seeks a model that has good properties w.r.t. the loss from the point forecasts. Regarding the more interesting question, I do not disagree (I am even writing a paper on density forecasting right now). But this thread by itself is not about that. — Richard Hardy, May 10 '21 at 08:40
@StephanKolassa, more generally, density forecasting is good when you do not have the user's loss function. When you do, point forecasting is sufficient. Since it is usually easier, I would avoid first producing a density forecast and then converting it to a point forecast when possible, because this appears generally inefficient (as I mentioned in one of our earlier discussions). This has to do with the focus parameter and the difference between AIC and FIC (focused information criterion). If we know what the focus is, AIC may mislead us, while FIC will be spot on (in theory, of course). — Richard Hardy, May 10 '21 at 08:43
I see your point, even if I repectfully disagree. I see too many questions on CV about the MAPE and similar to believe that users are anything else than deeply confused about their loss functions, so I have been shrilly arguing for density forecasts that people can extract their favorite functionals from (like [here](https://stats.stackexchange.com/a/494032/1352) and [here](https://doi.org/10.1016/j.ijforecast.2019.02.017)). But I understand your question better now, and it's a good one, although I wouldn't call the loss functions "mismatched" as in your question title. — Stephan Kolassa, May 10 '21 at 08:56
In any case, while I'm digging through Stone (1977), do you have any recommended references on the FIC, which I don't think I have encountered before? — Stephan Kolassa, May 10 '21 at 08:58
@StephanKolassa, thank you for a thoughtful discussion. I can easily picture that users are confused about their loss functions! Regarding FIC, the corresponding [Wikipedia article](https://en.wikipedia.org/wiki/Focused_information_criterion) contains the main references. Besides that, you may follow the work of Gerda Claeskens; she and her students have done quite a bit more on FIC since 2003. — Richard Hardy, May 10 '21 at 09:38
I believe the first question we need to address is: LOOCV *of what*? "LOOCV" is not a complete description. You can fit models and LOOCV the MSE, the MAE, a quantile loss, the (log) likelihood, a proper scoring rule, or anything else. Stone (1977) discusses LOOCV log likelihood. Thus, since the log score is a proper scoring rule, we will asymptotically obtain the correct LOOCV predictive density. If our data are IID conditional on predictors, this seems to imply that AIC will asymptotically get us the true future density. ... — Stephan Kolassa, May 20 '21 at 13:57
... I think we are close enough to [my answer](https://stats.stackexchange.com/a/523522/1352) to [your related question](https://stats.stackexchange.com/q/425675/1352) that I am sorely tempted to vote-to-close as a duplicate. What do you think? — Stephan Kolassa, May 20 '21 at 13:58
@StephanKolassa, In my view, the two questions are clearly distinct. Your answer to one of them may be broad enough to (partly) cover both questions, but I would not consider this a good reason for merging the questions. Regarding LOOCV of what: LOOCV based on the loss function used for evaluation (e.g. square loss, quantile loss, ...). It does not have to equal to the negative (log-)likelihood. — Richard Hardy, May 20 '21 at 14:30
Hm. Then I'm not quite sure what you are asking here. Of course, AIC will not asymptotically estimate LOOCV square loss (because it asymptotically estimates LOOCV log-likelihood, and that's different). Are you asking about using a pipeline "use AIC to pick a model, then calculate the expected squared loss against that model's conditional expectation", and whether this pipeline will converge to the LOOCV squared loss? If so, I'd say that's exactly that other question. If not, what exactly is your question here? — Stephan Kolassa, May 20 '21 at 15:20
@StephanKolassa, I will come back to this, but at the moment I am bothered by a couple of deadlines at work... — Richard Hardy, May 20 '21 at 16:48
I have looked at the two questions again and cannot see how they could be considered duplicates. This question asks about equivalence between A and B while the other asks about optimality of A. (Here, A ~ AIC, B ~ LOOCV.) Clearly, these are distinct questions. Now, in the currect question I am interested in model selection, not in estimating the expected loss of the selected model. I am interested in under what conditions AIC and LOOCV will select he same model. Is the match between the loss function and the negative (log)-likelihood strictly necessary for selecting the same model? — Richard Hardy, May 20 '21 at 18:22

Equivalence of AIC and LOOCV under mismatched loss functions

0 Answers0

Linked