38

I have seen some questions here about what it means in layman terms, but these are too layman for for my purpose here. I am trying to mathematically understand what does the AIC score mean.

But at the same time, I don't want a rigor proof that would make me not see the more important points. For example, if this was calculus, I would be happy with infinitesimals, and if this was probability theory, I would be happy without measure theory.

My attempt

by reading here, and some notation sugar of my own, $\text{AIC}_{m,D}$ is the AIC criterion of model $m$ on dataset $D$ as follows: $$ \text{AIC}_{m,D} = 2k_m - 2 \ln(L_{m,D}) $$ where $k_m$ is the number of parameters of model $m$, and $L_{m,D}$ is the maximum likelihood function value of model $m$ on dataset $D$.

Here is my understanding of what the above implies:

$$ m = \underset{\theta}{\text{arg max}\,} \Pr(D|\theta) $$

This way:

  • $k_m$ is the number of parameters of $m$.
  • $L_{m,D} = \Pr(D|m) = \mathcal{L}(m|D)$.

Let's now rewrite AIC: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2k_m - 2 \ln(\Pr(D|m))\\ =& 2k_m - 2 \log_e(\Pr(D|m))\\ \end{split}$$

Obviously, $\Pr(D|m)$ is the probability of observing dataset $D$ under model $m$. So the better the model $m$ fits the dataset $D$, the larger $\Pr(D|m)$ becomes, and thus smaller the term $-2\log_e(\Pr(D|m))$ becomes.

So clearly AIC rewards models that fit their datasets (because smaller $\text{AIC}_{m,D}$ is better).

On the other hand, the term $2k_m$ clearly punishes models with more parameters by making $\text{AIC}_{m,D}$ larger.

In other words, AIC seems to be a measure that:

  • Rewards accurate models (those that fit $D$ better) logarithmically. E.g. it rewards increase in fitness from $0.4$ to $0.5$ more than it rewards the increase in fitness from $0.8$ to $0.9$. This is shown in the figure below.
  • Rewards reduction in parameters linearly. So decrease in parameters from $9$ down to $8$ is rewarded as much as it rewards the decrease from $2$ down to $1$.

enter image description here

In other words (again), AIC defines a trade-off between the importance of simplicity and the importance of fitness.

In other words (again), AIC seems to suggest that:

  • The importance of fitness diminishes.
  • But the importance of simplicity never diminishes but is rather always constantly important.

Q1: But a question is: why should we care about this specific fitness-simplicity trade-off?

Q2: Why $2k$ and why $2 \log_e(\ldots)$? Why not just: $$\begin{split} \text{AIC}_{m,D} =& 2k_m - 2 \ln(L_{m,D})\\ =& 2(k_m - \ln(L_{m,D}))\\ \frac{\text{AIC}_{m,D}}{2} =& k_m - \ln(L_{m,D})\\ \text{AIC}_{m,D,\text{SIMPLE}} =& k_m - \ln(L_{m,D})\\ \end{split}$$ i.e. $\text{AIC}_{m,D,\text{SIMPLE}}$ should in y view be equally useful to $\text{AIC}_{m,D}$ and should be able to serve for relatively comparing different models (it's just not scaled by $2$; do we need this?).

Q3: How does this relate to information theory? Could someone derive this from an information theoretical start?

caveman
  • 2,431
  • 1
  • 16
  • 32
  • 2
    What does your notation in $m=\arg \max_\theta Pr(D|\theta)$ mean? Are you implying something about model choice there? What you had above does not really imply that AIC requires you to choose a model. Q2, as you say, is something pretty arbitrary in some sense, but comes from making AIC an estimate for the Kullback-Leibler divergence, which also relates to the answer for Q1 and gives some meaning to quantities like $\exp((\text{AIC}_m-\min(\text{AIC}_1,\ldots,\text{AIC}_M))/2)$. – Björn Jun 01 '16 at 05:49
  • $\text{arg max}_{\theta} \Pr(D|\theta)$ means keep looking for many $\theta$s until you find one that minimizes the probability $\Pr(D|\theta)$. Each $\theta$ is a tuple/vector of parameters that defines our model that tries to explain dataset $D$. So essentially it says: we have dataset $D$, what is the probability that it was generated by a model parametrized by $\theta$? Our model $m$ is essentially $\theta$ that solves this maximization problem. – caveman Jun 01 '16 at 06:00
  • 3
    Sorry, but are you looking across multiple models (since you write $m=\ldots$), or are you talking about the maximum likelihood estimate $\hat{\theta} := \arg\max_\theta P_\text{given model}(D|\theta)$? Also note $P_\text{given model}(D|\theta)$ is the probability of the data haven arisen under the given model and for the given parameters, not the probability that the data was generated by that model parameterized by $\theta$. – Björn Jun 01 '16 at 06:55
  • MLE is what I mean. But I'm just trying to say that the parameters tuple $\theta$ is so comprehensive that it also defines the model. Also I can have multiple models, say $m_1,m_2$ each with a different AIC score $\text{AIC}_1, \text{AIC}_2$. I am just making this notation up because I think it's simpler. Am I being terribly wrong, or unnecessarily confusing this? (and thank you for correcting me on what the MLE means) – caveman Jun 01 '16 at 08:38
  • 3
    A derivation of AIC as an approximation to expected K-L information loss is given in Pawitan (2001), *In All Likelihood*, Ch 13. – Scortchi - Reinstate Monica Jun 21 '16 at 09:41
  • As regards Q1, there are other information criteria out there (as you probably know). Of course various justifications for choosing AIC over a competitor can be given. As it picks models with more parameters than the BIC for example, sometimes it holds that the finite-sample problems of picking smaller models outweigh the inefficiencies of more estimated parameters. It very much depends on the context, sometimes bordering on a matter of taste (in my view). – Sven S. Jul 05 '16 at 20:05
  • A normal distribution of residuals is assumed for the derivation of AIC. AIC is largely uncharacterized for non-normality of residuals, for local maxima during maximum likelihood optimization, and with respect to the physicality of goodness-or-fit as an assumption. For particular cases, for example, for finding distributions to fit to histograms, I note that Mathematica v10.3 listed AIC as a selection criterion, but used a combined parameter based on likelihood, and has since v11 seems to have abandoned that approach entirely in favor of Cramer-Von Mises, Chi-squared or other direct p-tests. – Carl Aug 09 '16 at 14:06
  • What about "-0.5 times AIC is model likelihood corrected for overfitting" (so it is a measure of likelihood that you could actually care about, not the plain likelihood which suffers from overfitting and therefore is misleading)? – Richard Hardy Sep 08 '16 at 16:06
  • What does the constant have to do with anything? $If a>b$, then $1/2 a>1/2 b$. And the AIC comparison is unchanged. The accommodation for numbers of parameters is understandable in a general way, the specifics I cannot comment on. – Carl Sep 13 '16 at 14:43

3 Answers3

15

This question by caveman is popular, but there were no attempted answers for months until my controversial one. It may be that the actual answer below is not, in itself, controversial, merely that the questions are "loaded" questions, because the field seems (to me, at least) to be populated by acolytes of AIC and BIC who would rather use OLS than each others' methods. Please look at all the assumptions listed, and restrictions placed on data types and methods of analysis, and please comment on them; fix this, contribute. Thus far, some very smart people have contributed, so slow progress is being made. I acknowledge contributions by Richard Hardy and GeoMatt22, kind words from Antoni Parellada, and valiant attempts by Cagdas Ozgenc and Ben Ogorek to relate K-L divergence to an actual divergence.

Before we begin let us review what AIC is, and one source for this is Prerequisites for AIC model comparison and another is from Rob J Hyndman. In specific, AIC is calculated to be equal to

$$2k - 2 \log(L(\theta))\,,$$

where $k$ is the number of parameters in the model and $L(\theta)$ the likelihood function. AIC compares the trade-off between variance ($2k$) and bias ($2\log(L(\theta))$) from modelling assumptions. From Facts and fallacies of the AIC, point 3 "The AIC does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead." The AIC is the penalized likelihood, whichever likelihood you choose to use. For example, to resolve AIC for Student's-t distributed residuals, we could use the maximum-likelihood solution for Student's-t. The log-likelihood usually applied for AIC is derived from Gaussian log-likelihood and given by

$$ \log(L(\theta)) =-\frac{|D|}{2}\log(2\pi) -\frac{1}{2} \log(|K|) -\frac{1}{2}(x-\mu)^T K^{-1} (x-\mu), $$

$K$ being the covariance structure of the model, $|D|$ the sample size; the number of observations in the datasets, $\mu$ the mean response and $x$ the dependent variable. Note that, strictly speaking, it is unnecessary for AIC to correct for the sample size, because AIC is not used to compare datasets, only models using the same dataset. Thus, we do not have to investigate whether the sample size correction is done correctly or not, but we would have to worry about this if we could somehow generalize AIC to be useful between datasets. Similarly, much is made about $K>>|D|>2$ to insure asymptotic efficiency. A minimalist view might consider AIC to be just an "index," making $K>|D|$ relevant and $K>>|D|$ irrelevant. However, some attention has been given to this in the form of proposing an altered AIC for $K$ not much larger than $|D|$ called AIC$_c$ see second paragraph of answer to Q2 below. This proliferation of "measures" only reinforces the notion that AIC is an index. However, caution is advised when using the "i" word as some AIC advocates equate use of the word "index" with the same fondness as might be attached to referring to their ontogeny as extramarital.

Q1: But a question is: why should we care about this specific fitness-simplicity trade-off?

Answer in two parts. First the specific question. You should only care because that was the way it was defined. If you prefer there is no reason not to define a CIC; a caveman information criterion, it will not be AIC, but CIC would produce the same answers as AIC, it does not effect the tradeoff between goodness-of-fit and positing simplicity. Any constant that could have been used as an AIC multiplier, including one times, would have to have been chosen and adhered to, as there is no reference standard to enforce an absolute scale. However, adhering to a standard definition is not arbitrary in the sense that there is room for one and only one definition, or "convention," for a quantity, like AIC, that is defined only on a relative scale. Also see AIC assumption #3, below.

The second answer to this question pertains to the specifics of AIC tradeoff between goodness-of-fit and positing simplicity irrespective of how its constant multiplier would have been chosen. That is, what actually effects the "tradeoff"? One of the things that effects this, is to degree of freedom readjust for the number of parameters in a model, this led to defining an "new" AIC called AIC$_c$ as follows:

$$\begin{align}AIC_c &= AIC + \frac{2k(k + 1)}{n - k - 1}\\ &= \frac{2kn}{n-k-1} - 2 \ln{(L)}\end{align} \,,$$

where $n$ is the sample size. Since the weighting is now slightly different when comparing models having different numbers of parameters, AIC$_c$ selects models differently than AIC itself, and identically as AIC when the two models are different but have the same number of parameters. Other methods will also select models differently, for example, "The BIC [sic, Bayesian information criterion] generally penalizes free parameters more strongly than the Akaike information criterion, though it depends..." ANOVA would also penalize supernumerary parameters using partial probabilities of the indispensability of parameter values differently, and in some circumstances would be preferable to AIC use. In general, any method of assessment of appropriateness of a model will have its advantages and disadvantages. My advice would be to test the performance of any model selection method for its application to the data regression methodology more vigorously than testing the models themselves. Any reason to doubt? Yup, care should be taken when constructing or selecting any model test to select methods that are methodologically appropriate. AIC is useful for a subset of model evaluations, for that see Q3, next. For example, extracting information with model A may be best performed with regression method 1, and for model B with regression method 2, where model B and method 2 sometimes yields non-physical answers, and where neither regression method is MLR, where the residuals are a multi-period waveform with two distinct frequencies for either model and the reviewer asks "Why don't you calculate AIC?"

Q3 How does this relate to information theory:

MLR assumption #1. AIC is predicated upon the assumptions of maximum likelihood (MLR) applicability to a regression problem. There is only one circumstance in which ordinary least squares regression and maximum likelihood regression have been pointed out to me as being the same. That would be when the residuals from ordinary least squares (OLS) linear regression are normally distributed, and MLR has a Gaussian loss function. In other cases of OLS linear regression, for nonlinear OLS regression, and non-Gaussian loss functions, MLR and OLS may differ. There are many other regression targets than OLS or MLR or even goodness of fit and frequently a good answer has little to do with either, e.g., for most inverse problems. There are highly cited attempts (e.g., 1100 times) to use generalize AIC for quasi-likelihood so that the dependence on maximum likelihood regression is relaxed to admit more general loss functions. Moreover, MLR for Student's-t, although not in closed form, is robustly convergent. Since Student-t residual distributions are both more common and more general than, as well as inclusive of, Gaussian conditions, I see no special reason to use the Gaussian assumption for AIC.

MLR assumption #2. MLR is an attempt to quantify goodness of fit. It is sometimes applied when it is not appropriate. For example, for trimmed range data, when the model used is not trimmed. Goodness-of-fit is all fine and good if we have complete information coverage. In time series, we do not usually have fast enough information to understand fully what physical events transpire initially or our models may not be complete enough to examine very early data. Even more troubling is that one often cannot test goodness-of-fit at very late times, for lack of data. Thus, goodness-of-fit may only be modelling 30% of the area fit under the curve, and in that case, we are judging an extrapolated model on the basis of where the data is, and we are not examining what that means. In order to extrapolate, we need to look not only at the goodness of fit of 'amounts' but also the derivatives of those amounts failing which we have no "goodness" of extrapolation. Thus, fit techniques like B-splines find use because they can more smoothly predict what the data is when the derivatives are fit, or alternatively inverse problem treatments, e.g., ill-posed integral treatment over the whole model range, like error propagation adaptive Tikhonov regularization.

Another complicated concern, the data can tell us what we should be doing with it. What we need for goodness-of-fit (when appropriate), is to have the residuals that are distances in the sense that a standard deviation is a distance. That is, goodness-of-fit would not make much sense if a residual that is twice as long as a single standard deviation were not also of length two standard deviations. Selection of data transforms should be investigated prior to applying any model selection/regression method. If the data has proportional type error, typically taking the logarithm before selecting a regression is not inappropriate, as it then transforms standard deviations into distances. Alternatively, we can alter the norm to be minimized to accommodate fitting proportional data. The same would apply for Poisson error structure, we can either take the square root of the data to normalize the error, or alter our norm for fitting. There are problems that are much more complicated or even intractable if we cannot alter the norm for fitting, e.g., Poisson counting statistics from nuclear decay when the radionuclide decay introduces an exponential time-based association between the counting data and the actual mass that would have been emanating those counts had there been no decay. Why? If we decay back-correct the count rates, we no longer have Poisson statistics, and residuals (or errors) from the square-root of corrected counts are no longer distances. If we then want to perform a goodness-of-fit test of decay corrected data (e.g., AIC), we would have to do it in some way that is unknown to my humble self. Open question to the readership, if we insist on using MLR, can we alter its norm to account for the error type of the data (desirable), or must we always transform the data to allow MLR usage (not as useful)? Note, AIC does not compare regression methods for a single model, it compares different models for the same regression method.

AIC assumption #1. It would seem that MLR is not restricted to normal residuals, for example, see this question about MLR and Student's-t. Next, let us assume that MLR is appropriate to our problem so that we track its use for comparing AIC values in theory. Next we assume that have 1) complete information, 2) the same type of distribution of residuals (e.g., both normal, both Student's-t) for at least 2 models. That is, we have an accident that two models should now have the type of distribution of residuals. Could that happen? Yes, probably, but certainly not always.

AIC assumption #2. AIC relates the negative logarithm of the quantity (number of parameters in the model divided by the Kullback-Leibler divergence). Is this assumption necessary? In the general loss functions paper a different "divergence" is used. This leads us to question if that other measure is more general than K-L divergence, why are we not using it for AIC as well?

The mismatched information for AIC from Kullback-Leibler divergence is "Although ... often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric." We shall see why shortly.

The K-L argument gets to the point where the difference between two things the model (P) and the data (Q) are

$$D_{\mathrm{KL}}(P\|Q) = \int_X \log\!\left(\frac{{\rm d}P}{{\rm d}Q}\right) \frac{{\rm d}P}{{\rm d}Q} \, {\rm d}Q \,,$$

which we recognize as the entropy of ''P'' relative to ''Q''.

AIC assumption #3. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm. The constant multiplier might have more meaning if AIC were relating more than one data set at at time. As it stands when comparing methods, if $AIC_{data,model 1}<AIC_{data,model 2}$ then any positive number times that will still be $<$. Since it is arbitrary, setting the constant to a specific value as a matter of definition is also not inappropriate.

AIC assumption #4. That would be that AIC measures Shannon entropy or self information." What we need to know is "Is entropy what we need for a metric of information?"

To understand what "self-information" is, it behooves us to normalize information in a physical context, any one will do. Yes, I want a measure of information to have properties that are physical. So what would that look like in a more general context?

The Gibbs free-energy equation ($\Delta G = ΔH – TΔS$) relates the change in energy to the change in enthalpy minus the absolute temperature times the change in entropy. Temperature is an example of a successful type of normalized information content, because if one hot and one cold brick are placed in contact with each other in a thermally closed environment, then heat will flow between them. Now, if we jump at this without thinking too hard, we say that heat is the information. But is it the relative information that predicts behaviour of a system. Information flows until equilibrium is reached, but equilibrium of what? Temperature, that's what, not heat as in particle velocity of certain particle masses, I am not talking about molecular temperature, I am talking about gross temperature of two bricks which may have different masses, made of different materials, having different densities etc., and none of that do I have to know, all I need to know is that the gross temperature is what equilibrates. Thus if one brick is hotter, then it has more relative information content, and when colder, less.

Now, if I am told one brick has more entropy than the other, so what? That, by itself, will not predict if it will gain or lose entropy when placed in contact with another brick. So, is entropy alone a useful measure of information? Yes, but only if we are comparing the same brick to itself thus the term "self-information."

From that comes the last restriction: To use K-L divergence all bricks must be identical. Thus, what makes AIC an atypical index is that it is not portable between data sets (e.g., different bricks), which is not an especially desirable property that might be addressed by normalizing information content. Is K-L divergence linear? Maybe yes, maybe no. However, that does not matter, we do not need to assume linearity to use AIC, and, for example, entropy itself I do not think is linearly related to temperature. In other words, we do not need a linear metric to use entropy calculations.

One good source of information on AIC is in this thesis. On the pessimistic side this says, "In itself, the value of the AIC for a given data set has no meaning." On the optimistic side this says, that models that have close results can be differentiated by smoothing to establish confidence intervals, and much much more.

Carl
  • 11,532
  • 7
  • 45
  • 102
  • 2
    Could you indicate the main difference between the new answer and the old deleted answer? It seems there is quite some overlap. – Richard Hardy Sep 13 '16 at 11:51
  • 3
    I was in the middle of editing my answer for some hours when it was deleted. There were a lot of changes compared to when I started as it was a work in-progress, took a lot of reading and thinking, and my colleagues on this site do not seem to care for it, but are not helping answer anything. AIC it seems is too good for critical review, how dare I? I completed my edit and re-posted it. I want to know what is incorrect about my answer. I worked hard on it, and have tried to be truthful, and, no-one else has bothered. – Carl Sep 13 '16 at 14:44
  • 4
    Don't get upset. My first experience here was also frustrating, but later I learned to ask questions in an appropriate way. Keeping a neutral tone and avoiding strong opinions that are not based on hard facts would be a good first step, IMHO. (I have upvoted your question, by the way, but still hesitate about the answer.) – Richard Hardy Sep 13 '16 at 14:53
  • Finally, something helpful, will try for neutrality. My emotional reaction is unjustified, I should know better than to react when provoked. BTW, I am also trying to show also how to create an index or measure that would be more generally useful, and that, it seems is lost in translation. – Carl Sep 13 '16 at 15:02
  • 3
    +1 Just for your preamble. Now I'll keep on reading the answer. – Antoni Parellada Sep 13 '16 at 22:12
  • 2
    @AntoniParellada You have helped just by keeping the question from being deleted, which I appreciate. Working thru AIC has been difficult, and I do need help with it. Sure some of my insights are good, but I also have hoof in mouth disease, which other minds are better at catching than I. – Carl Sep 13 '16 at 22:34
  • 1
    *The log-likelihood applied for AIC is Gaussian* -- why do you think so? It can be any likelihood, it depends on the assumed distribution of the data. You also say $k$ is the number of fixed effects in one place and the number of parameters in another case. I think the former could be used everywhere for consistency. – Richard Hardy Sep 21 '16 at 13:49
  • @RichardHardy Please look carefully at the equation. The AIC log-likelihood equation comes from the logarithm of a normal distribution. Q. back 2U, what is the difference between saying "number of fixed effects" and "number of parameters" and why do you prefer the former? – Carl Sep 21 '16 at 15:55
  • @RichardHardy I put in links to both the equation and to its derivation from log-likelihood for a normal distribution. I use "number of parameters" inside of a quote, and got "number of fixed effects" from a quote. When I use either it is for consistency with the source material. I am writing defensively here, not necessarily well, but I still do not see the difference between the two other than style. Perhaps "number of fixed effects" would be better statistical jargon and "number of parameters" better mathematical physics jargon. I'm guessing, and want U'r response. – Carl Sep 21 '16 at 16:27
  • 1
    It is simply not true that AIC always uses the normal likelihood. See Rob J. Hyndman's post ["Facts and fallacies of the AIC"](http://robjhyndman.com/hyndsight/aic/), point 3. Regarding *fixed effects* versus *parameters*, perhaps it is a matter of taste, but *fixed effects* are more commonly used in panel data models where they have specific meaning, while *parameter* is the general term that seems to be the suitable one here; but you may disagree, I am not the ultimate expert on this. – Richard Hardy Sep 21 '16 at 17:21
  • @RichardHardy Points taken (+2). Point 3 included in text. Link to other Max likelihood functions put in. Use of word "parameters" now exclusive. Next question, I would usually say samples as in "time-samples" rather than "points" when describing a single $n$-dimensional data object. However, sample tends to be a collective noun in statistical parlance. Is there a noun more elegant than "points" to describe these line data entries? BTW, much thanks, at the end of the day, I might actually understand what AIC is. It is not a trivial accomplishment, that. – Carl Sep 21 '16 at 18:04
  • 1
    Sample points, data points, observations. (I am not entirely sure I understood what word you actually need here.) – Richard Hardy Sep 21 '16 at 18:07
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/45711/discussion-between-carl-and-richard-hardy). – Carl Sep 21 '16 at 18:48
  • You mention a need for *the same distribution of residuals for at least 2 models*. Why would you need that? Why can't you compare AICs of models with different distributional assumptions? – Richard Hardy Oct 18 '16 at 08:29
  • @RichardHardy The same type of distribution. Theoretically, I suppose one could solve different maximum likelihood solutions for differing assumptions on the same data with two different models. For example, Student's-t MLR and normal MLR, and since Student's-t includes normal distributions in its spectrum, that would work. For [gamma distribution MLR](http://math.stackexchange.com/questions/310487/likelihood-function-of-a-gamma-distributed-sample), and normal MLR residuals, it might work. But, I think we are missing the elephant in the room. Most people just use ND MLR and do not even check. – Carl Oct 18 '16 at 17:05
  • What about using distributions from different families? Could that be a problem in AIC comparisons? – Richard Hardy Oct 18 '16 at 17:10
  • @RichardHardy. Consider the gamma distribution (GD) and ND, for the GD to converge to an ND, one would use $x-u$ on the GD $x$-axis, then unavoidably $u$ goes to infinity. Maybe this is problematic, maybe it is irrelevant. To be honest, I don't know. It may merit a question all of its own. – Carl Oct 18 '16 at 17:21
  • Maybe. I don't see why AIC comparison would require the distributions to be somehow related. – Richard Hardy Oct 18 '16 at 17:31
  • I see a potential problem comparing bounded, semi-infinity support, infinite support, finite support, and [other](https://en.wikipedia.org/wiki/List_of_probability_distributions) PDFs to each other. However, I plead ignorance. It is possible to compare some distributions by using inclusive MLR of the same type. It may be possible to compare some MLR of different types. It would be risky to assume that all MLR of different boundings can be properly compared. – Carl Oct 18 '16 at 17:45
  • It is revealing to look at Wikipedia's example of how to [compare AIC-values from normal residuals with log normal residuals](https://en.wikipedia.org/wiki/Akaike_information_criterion#Transforming_data). They transform the data so that the comparison is done within the log-normal context. Transformed data and untransformed data do not plausibly have the same entropy. – Carl Oct 18 '16 at 18:17
  • 1
    Just to thank you for your amazing diligence to nail this. Also thank you for re-posting your answer again after it got deleted. I am glad that I am able to read your answer. Sorry for my absence as I'm very pressed lately. But I will read this and, once I read it carefully and understand it, I'll probably end up accepting it as an answer. – caveman Dec 06 '16 at 05:52
  • It was a really good question, and a personal challenge to get it together. It is actually the reason I joined CV (+1). – Carl Dec 06 '16 at 06:02
  • @RichardHardy Let me clarify. Different residuals distributions properly require different MLR algorithms. I think that one can compare them only if one finds a common distribution family for both, and one example is normal verses Student's, where both can be modeled as Student's-t distributions. – Carl Feb 22 '17 at 16:58
  • @caveman If you would, please mention what, if any, outstanding issues there are preventing you from accepting the answer, and I will do what I can to address them. I am still getting occasional un-commented downvotes, and it is annoying. – Carl Feb 22 '17 at 17:02
  • @Carl, thanks for pinging me. I am not aware of hard statistical arguments to support the statement in your last comment (while your intuitive argument does not ring a bell for me either). AFAIK it does not hold, but I cannot prove or disprove it myself, thus I am relying on my previous knowledge from academic papers and answers elsewhere on this site. Sorry I could not help. – Richard Hardy Feb 22 '17 at 17:21
  • @RichardHardy From [Facts and fallacies of the AIC](http://robjhyndman.com/hyndsight/aic/), point 3 by Rob J. Hyndman "The AIC (Sic, thus MLR) does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead." Note the AIC derivation above is only for Gaussian conditions. Link to [maximum likelihood for the t-distribution](http://stats.stackexchange.com/a/63861/99274). Anything outstanding we could ask Rob to look at, he is on this site. – Carl Feb 22 '17 at 18:16
  • @Carl The only outstanding issue is that I am too busy lately to revisit your fun-to-read post. I am extremely thankful to you for reposting your answer again after it was deleted. I was the first to up-vote your 2nd answer (sadly I didn't see your 1st answer) to ensure that it wouldn't be deleted again. This question is certainly in my list of to-do as soon as I get free (in 2 months maybe) – caveman Feb 23 '17 at 18:55
  • Rereading your answer, I find the use of MLR for "maximum likelihood" a little disturbing. The conventional acronyms are ML for "maximum likelihood" and MLE for "ML estimator". On the other hand, MLR is used for "multiple linear regression" e.g. in Woodldridge's introductory econometrics textbook. But if you insist on MLR for "maximum likelihood", then I suggest making this explicit the first time you use the acronym. Also, phrasing some statements in terms of "estimators" rather than "regression" (such as "maximum likelihood **estimator**" [of a linear model] instead of MLR) could help. – Richard Hardy Feb 25 '17 at 18:27
  • "There are many other types of regression than just OLS and MLR..." I think this could be more usefully rephrased in terms of models and estimators, because an "OLS regression" is a vague thing. It is probably supposed to mean a linear model that can be estimated using OLS and given the assumptions under which the OLS estimator has nice properties, isn't it? But this is not self-evident from the way you phrase it; one can also use OLS when it is not appropriate as long as the estimator is feasible. Etc etc. In any case, you have composed quite an opus by now :) – Richard Hardy Feb 25 '17 at 18:39
  • @RichardHardy No, I am referring to different regression targets. For any regression model, the regression target method could be a very large set. For me, OLS is not ambiguous, how do you see it as ambiguous? I changed this to "other regression targets than OLS or MLR or even goodness of fit" from "types of regression" to underline this. In other words, there is more to this than L1 and L2, lots more. – Carl Feb 26 '17 at 02:30
  • @Carl, "regression target" is again not a statistical term. What do you mean by it? Let me guess: the target could be to estimate a particular parameter or a few of them; to forecast $y_i$ for a given $X_i$ (conditionally) or withouth knowing $X_i$ in advance (unconditionally). All of that could be evaluated in terms of different loss functions, e.g. square loss, absolute loss, etc. – Richard Hardy Feb 26 '17 at 06:59
  • @RichardHardy For $y=f(x)$ a regression target may be some $MIn||g(x)||$. – Carl Feb 26 '17 at 16:37
  • Thank you for the clarification. I suppose that should involve $y$ as well. For now it seems you are using *target* to describe an objective function or a loss function that defines an estimator. – Richard Hardy Feb 26 '17 at 16:49
  • @RichardHardy Yes, more generally $\text{Min}||g(x,y)||_{L_{z}}$; a minimization of an objective function of an $L_z$ norm, presumably chosen for some explicit purpose, rather than a loss function related to curve fitting per se. Curve fitting is sometimes irrelevant to a proper inverse solution. – Carl Feb 26 '17 at 20:43
5

AIC is an estimate of twice the model-driven additive term to the expected Kullback-Leibler divergence between the true distribution $f$ and the approximating parametric model $g$.

K-L divergence is a topic in information theory and works intuitively (though not rigorously) as a measure of distance between two probability distributions. In my explanation below, I'm referencing these slides from Shuhua Hu. This answer still needs a citation for the "key result."

The K-L divergence between the true model $f$ and approximating model $g_{\theta}$ is $$ d(f, g_{\theta}) = \int f(x) \log(f(x)) dx -\int f(x) \log(g_{\theta}(x)) dx$$

Since the truth is unknown, data $y$ is generated from $f$ and maximum likelihood estimation yields estimator $\hat{\theta}(y)$. Replacing $\theta$ with $\hat{\theta}(y)$ in the equations above means that both the second term in the K-L divergence formula as well as the K-L divergence itself are now random variables. The "key result" in the slides is that the average of the second additive term with respect to $y$ can be estimated by a simple function of the likelihood function $L$ (evaluated at the MLE), and $k$, the dimension of $\theta$: $$ -\text{E}_y\left[\int f(x) \log(g_{\hat{\theta}(y)}(x)) \, dx \right] \approx -\log(L(\hat{\theta}(y))) + k.$$

AIC is defined as twice the expectation above (HT @Carl), and smaller (more negative) values correspond to a smaller estimated K-L divergences between the true distribution $f$ and the modeled distribution $g_{\hat{\theta}(y)}$.

Ben Ogorek
  • 4,629
  • 1
  • 21
  • 41
  • As you know, the term [deviance](https://en.wikipedia.org/wiki/Deviance_(statistics)) when applied to log-likelihood is jargon and inexact. I omitted discussion of this because only monotonicity is required for AIC differences to have comparative worth not linearity. So, I fail to see the relevance of trying overly hard to "visualize" something that likely is not there, and not needed anyway. – Carl Sep 21 '16 at 17:10
  • 2
    I see your point that the last paragraph adds a red herring, and I realize that nobody needs to be convinced that 2 * x ranks the same as x. Would if be fair to say that the quantity is multiplied by 2 "by convention"? – Ben Ogorek Sep 21 '16 at 23:16
  • 2
    Something like that. Personally, I would vote for "is defined as," because it was initially chosen that way. Or to put this in temporal perspective, any constant that could have been used, including one times, would have to have been chosen and adhered to, as there is no reference standard to enforce a scale. – Carl Sep 21 '16 at 23:57
5

A simple point of view for your first two questions is that the AIC is related to the expected out-of-sample error rate of the maximum likelihood model. The AIC criterion is based on the relationship (Elements of Statistical Learning equation 7.27) $$ -2 \, \mathrm{E}[\ln \mathrm{Pr}(D|\theta)] \approx -\frac{2}{N} \, \mathrm{E}[\ln L_{m,D}] + \frac{2k_m}{N} = \frac{1}{N} E[\mathrm{AIC}_{m,D}] $$ where, following your notation, $k_m$ is the number of parameters in the model $m$ whose maximum likelihood value is $L_{m,D}$.

The term on the left is the expected out-of-sample "error" rate of the maximum likelihood model $m = \{ \theta \}$, using the log of the probability as the error metric. The -2 factor is the traditional correction used to construct the deviance (useful because in certain situations it follows a chi-square distribution).

The right hand consists of the in-sample "error" rate estimated from the maximized log-likelihood, plus the term $2k_m/N$ correcting for the optimism of the maximized log-likelihood, which has the freedom to overfit the data somewhat.

Thus, the AIC is an estimate of the out-of-sample "error" rate (deviance) times $N$.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
jwimberley
  • 3,679
  • 2
  • 11
  • 20