Example and counterexample for Stone's (1977) assumption

Question

Stone (1977) considers the problem of the choice of predicting density for $y$ given $x$ from a prescribed class of formal predicting densities $\{f(y|x,\alpha,S), \alpha \in \mathscr{A}\}$ whose members are indexed by the choice parameter $\alpha$. He shows that AIC and LOOCV (leave-one-out cross validation) are asymptotically equivalent provided that the following assumption holds:

The conditional distribution of $y$ given $x$ in the distribution $P$ is $f(y|x,\theta^*)$ for some unique $\theta^* \in \Theta$, that is, the conventional model $\{f(y|x,\theta),\theta \in \Theta)\}$ is true.

I am having a hard time understanding this formal requirement and using it in applications.

Could anyone illustrate when this assumption holds vs. when it fails by an example and a counterexample?

References

Stone, M. (1977). An asymptotic equivalence of choice of model by cross‐validation and Akaike's criterion. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 44-47.

Related question: [Is this a typo in Stone's (1977) paper on asymptotic equivalence between AIC and LOOCV?](https://stats.stackexchange.com/questions/407286/). — Richard Hardy, May 08 '19 at 14:59
Related question: [Equivalence of AIC and LOOCV under mismatched loss functions](https://stats.stackexchange.com/questions/406430/equivalence-of-aic-and-loocv-under-mismatched-loss-functions). — Richard Hardy, May 08 '19 at 15:36

score 1 · Answer 1 · answered May 10 '21 at 09:59

This simply formalizes the fact that AIC asymptotically picks the correct distribution out of our hat if it was in the hat in the first place.

Simply said, if your data come from a $\text{Pois}(\lambda^\ast)$ distribution, and your hat contains all possible Poisson distributions, then you are good. But if your hat only contains Poisson distributions with parameters that differ by at least $\epsilon>0$ from $\lambda^\ast$, then of course you will not get there. (If your hat contains everything except the one true value, $\mathbb{R}\setminus\{\lambda^\ast\}$, then I presume that the asymptotic result still holds, because we can get arbitrarily close to $\lambda^\ast$. But that is not a counterexample to the statement.)

Alternatively, perhaps your data come from a negative binomial distribution, but your hat only contains Poissons. (The other way around probably works again, because we can approximate a Poisson using a negative binomial, by reducing the overdispersion parameter far enough. Again, not a counterexample.)

The other part of the statement is uniqueness: there must be only a single parameter $\theta^\ast$ in the hat that gives the true distribution $f(y|x,\theta)$, so we can asymptotically converge to it. The counterexample to uniqueness would be a hat containing a slice of the real plane $\{(\theta,\tau)\in\mathbb{R}^2|0<\theta<\tau\}$, where a parameter vector $(\theta,\tau)$ parameterizes a $\text{Pois}(\tau-\theta)$ distribution. Then of course a single Poisson will be parameterized by many different pairs $(\theta,\tau)$... but they will be indistinguishable, since they all yield the same PMF.

I honestly don't see (yet) how this is important. I would assume that non-uniqueness of $\theta^\ast$ would simply mean that we would approach the solution space arbitrarily closely but might oscillate wildly within it - but this would, again, not make any difference in terms of what we can actually observe, just make the mathematics more opaque.

Example and counterexample for Stone's (1977) assumption

1 Answers1

Linked