How freqently are the information conditons for proper Akaike information criterion application actually met?

Question

Question: what is assumed for AIC, when is this correct, and how do we know it is ever correct? How do we verify the assumptions?

Akaike information criterion (AIC) is limited to goodness-of-fit for assumed distributions. That is, one can assume a maximum likelihood function but almost all of the time AIC is used for an assumed normal distribution of residuals. It is unclear to me how heteroscedastic residuals from nested models of a larger model with uncertain parameters relate to likelihood functions. It is further unclear how nested model solutions for larger problems with some uncertain parameters that can nevertheless be solved for example using Tikhonov regularization, relate to the likelihood of anything, when this is done adaptively to stabilize an ill-posed integral Tk-GV. Theil regression, Deming regression and myriad other regression techniques are superior to ordinary least squares for finding the relations between co-varying functions, and for which goodness-of-fit is largely irrelevant, counterproductive, or mutated enough to be out of context. My question then is, "What are the exact information criteria that would permit AIC to be used properly, and how reasonable is it to assume that that is physically relevant for inverse problem solutions with any appreciable frequency?"

Simple example: Suppose that I own shares in a company. Does goodness of fit or a likelihood analysis tell me what those shares will be worth tomorrow? Doubtful, isn't it? Suppose I persist and try to make a prediction, do I not have to match the derivatives of the trending data as well as the data itself? In fact, I do not then care if the model is parsimonious today, I only care about parsimony for tomorrow. Does AIC apply to selecting a model for tomorrow's share price? Would not extrapolation testing be a better approach to model selection, as in "see what works"? In order to see what works, AIC would not tell us about how errors are organized in the same sense as examining a residual plot. One can test extrapolation by withholding data, and the AIC from the fit region does not contain the withheld information, thus is not useful for testing prediction. Ordinary least squares (OLS) is not useful extrapolation either as the residuals are systematically biased, making Theil median regression more useful for extrapolation than OLS, and Wilcoxon signed rank sum testing more useful to measure the accuracy of extrapolation, where no accuracy information is available from AIC.

AIC is certainly not limited to ordinary least squares (OLS) models. It is derived in a maximum likelihood setting which includes many models not estimable with OLS. Regarding regularization, AIC is available for LASSO fits (perhaps also Tikhonov regularization, I am not sure), which tells it could be used quite broadly; the key is estimating the effective number of degrees of freedom of a model in a sensible way. — Richard Hardy, May 20 '16 at 18:32
It is not designed to be used with the same model and two different types of regression analysis. Moreover, if you have any idea of how to apply it to adaptive Tikhonov regularization for minimization of a combination of parameters that is unrelated to goodness of fit evaluated at the x-axis points, but is minimizing the error of the integral of x from 0 to infinity, then do tell. And, that is the proper use for Tikhonov regularization, to solve for ill-posed integrals, not to fit a curve. — Carl, May 21 '16 at 00:59
You have a lot of questions in one post (and the title does not quite reflect them, IMHO). It would be easier for the audience if you focused on one concrete question. Also, could you specify what are *inverse problems*, *propagated errors*, etc., if you need to use those terms? — Richard Hardy, May 21 '16 at 16:06
[Prerequisites for AIC model comparison](http://stats.stackexchange.com/questions/48714/prerequisites-for-aic-model-comparison) could be a relevant thread. Also, are you aware of the fact that AIC is asymptotically equivalent to cross validation (which is what you refer to as *extrapolation testing*, if I understand you correctly)? — Richard Hardy, May 21 '16 at 16:14
Cross validation is an interpolation technique to test transplanted goodness-of-fit within a range of random unsorted data. Extrapolation is sorted data where the first data set has no overlap within the range of the x-coordinates of the second, predicted data set. For extrapolation, one has to have some understanding of the mechanics of the system and what functions are demonstrably basis function related. For AIC, the likelihood assumptions are not usually tested. Without testing, the validity is questionable. — Carl, May 21 '16 at 17:34
Moreover, the link I gave suggests that the likelihood assumption may be untestable. [link](http://philpapers.org/go.pl?id=FORCTA&proxyId=none&u=http%3A%2F%2Fphilosophy.wisc.edu%2Fforster%2FLTE.pdf). To define all the needed terms, I may have to post a trial answer, and links to a lot of sources. Is that fair game here? BTW, thanks for paying attention to this, AIC comes up as a reviewer request frequently in the papers I submit, and that usually occurs when I am comparing different regression models. It is not possible to tell a reviewer that the request is nonsense. — Carl, May 21 '16 at 18:10

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

Here is a good if partial explanation of what AIC is here. From that, we can see that the residuals are so frequently assumed to be Gaussian, that one often forgets that that is not necessarily the case. When they are Gaussian the only variable term is the loss (or fit) term for optimization of log-likelihood is $-\frac{1}{2}(x-\mu)^T K^{-1} (x-\mu)$.

This results in several problems: 1) Gaussian loss is optimistic, and not a given. 2) In general loss (or reward) functions can have local minima, be indeterminate, and 3) be totally inappropriate to the physics of "loss" of the problems they model. It is more appropriate to model the loss function explicitly rather than just assume a Gaussian. For example, counting of nuclear decay statistics is Poisson, not Gaussian.

With respect to 1) the Gaussian assumption, there have been attempts to generalize AIC for mixture distributions. Indeed, there has been an attempt to generalize AIC to be distribution free for generalized estimating equations GEE. Indeed, if you want to use some other distribution, go ahead... and do so.

To go ahead and use AIC, AICc and BIC while making specific Gaussian assumptions concerning likelihood that can sometimes be irrelevant to the physical problems under consideration. For example, let us consider AIC in the context of Tikhonov regularization. There are many criteria that can be applied to selecting smoothing factors for Tikhonov regularization. To use AIC in that context, there is a paper that makes rather specific assumptions as to how to perform that regularization, Information complexity-based regularization parameter selection for solution of ill conditioned inverse problems. In specific, this assumes

"In a statistical framework, ...choosing the value of the regularization parameter α, and by using the maximum penalized likelihood (MPL) method....If we consider uncorrelated Gaussian noise with variance $\sigma ^2$ and use the penalty $p(x) =$ some norm, the MPL solution is the same as the Tikhonov (1963) regularized solution."

The question then becomes, why should anyone make those assumptions? For the problems I deal with, the residuals are too poorly behaved to make a Gaussian assumption. Even if one were to generalize the residuals, there is no known solution so that a penalty function cannot be specified, and a goodness of fit criterion will not allow for specifying the parameters that one wishes to extract from the modelling process, especially since goodness of fit and AIC pertain to the data in a sense of leave one out (LOO), but not extrapolation to infinite time from within an incomplete range data, as would occur in a time series, for which we do not have infinite observation times.

An inverse solution which optimizes AUC to infinite time can be found using Tikhonov regularization to minimize relative error of whichever parameter one desires to measure for a proportional error measurement system. See Tikhonov adaptively regularized gamma variate fitting to assess plasma clearance of inert renal markers. Inverse problems are not generally curve fitting, but can be used for minimization of propagated error, although admittedly they rarely are because people often try to minimize arbitrary criteria without thinking through what they are trying to do. Thus, a proper inverse problem solution, one that has a stated goal related to the physical quantity being optimized may allow identification of what likelihood assumptions should have been made to begin with, and in general one is better off not making that type of assumption as a prior but deriving it as a post. For example, read this.

So, yes one can "use" AIC with Tikhonov regularization, but even if one does that correctly, it will not quantify a time series properly.

Here is a link that features some regression types, about a dozen or so out of many many others. AIC is only mentioned once and less frequently than BIC with the comment "They (sic, AIC and BIC) also tend to break when the problem is badly conditioned (more features than samples)." To use AIC we are still relegated to using MLR, and that is a severe limitation. One set of exact conditions for OLS that is unrelated to maximum likelihood includes having no x-axis data uncertainty as applies only for special circumstances, like when exact equal interval time series are plotted, as well as homoscedasticity, which requirements are often totally ignored resulting in biased regression. That is, the intersection of OLS and MLR is small and in the linear case..

More time should be spent on the conditions that pertain to the physics of the problem being considered, which is often a sore point of contention between statistical procedures and the need to balance physical units or perform goal oriented regression than to blindly apply procedures. Even those who study AIC intensely will admit that sometimes we stand to gain less useful information from it than we would from ANOVA with partial probabilities of parameters.

There is one paper that suggests that AIC relates to nested models, and there appears to be no agreement with that opinion, e.g., see Hyndman point #4.

Another dimension to this problem is that AIC really is not reliable for model selection, even in the restricted conditions in which it is touted to be useful. The only criteria that are generally portable between models of anything are the Pearson Chi-squared probabilities and for density models, Cramér-von Mises probabilities. It is a Pyrrhic victory to have a better AIC index for a fit that has a p < 0.0001 of being correct. It would appear, for example, that the prior ranking of models for finding distributions in Mathematica 10.3 a combined score from maximum likelihood was used. This now appears to have been changed to using BIC in their recent v11 release.

Finally, to use AIC we should

1) Determine if MLR is applicable to the problem type; example, for extrapolation MLR is not the best regression to use.

2) Assume ND residuals, then test MLR residuals for ND. If yes, continue, if no change MLR residual model, and retest.

3) Calculate AIC when MLR residuals agree with MLR residual assumption.

However, other procedures should still be explored, especially since a better AIC value does not tell us if a model is useful.

How freqently are the information conditons for proper Akaike information criterion application actually met?

1 Answers1

Linked