11

In the Wikipedia entry for Akaike information criterion, we read under Comparison with BIC (Bayesian information criterion) that

...AIC/AICc has theoretical advantages over BIC...AIC/AICc is derived from principles of information; BIC is not...BIC has a prior of 1/R (where R is the number of candidate models), which is "not sensible"...AICc tends to have practical/performance advantages over BIC...AIC is asymptotically optimal...BIC is not asymptotically optimal...the rate at which AIC converges to the optimum is...the best possible.

In the AIC talk section, there are numerous comments about the biased presentation of the comparison with BIC section. One frustrated contributor protested that the whole article "reads like a commercial for cigarettes."

In other sources, for example in this thesis appendix, the tenor of the claims for AIC seem more realistic. Thus, as a service to the community, we ask:

Q: Are there circumstances in which BIC is useful and AIC is not?

Carl
  • 11,532
  • 7
  • 45
  • 102

3 Answers3

6

According to Wikipedia, the AIC can be written as follows: $$ 2k - 2 \ln(\mathcal L) $$ As the BIC allows a large penalization for complex models there are situations in which the AIC will hint that you should select a model that is too complex, while the BIC is still useful. The BIC can be written as follows: $$ -2 \ln(\mathcal L) + k \ln(n) $$ So the difference is that the BIC penalizes for the size of the sample. If you do not want to penalize for the sample there

A quick explanation by Rob Hyndman can be found here: Is there any reason to prefer the AIC or BIC over the other? He writes:

  • AIC is best for prediction as it is asymptotically equivalent to cross-validation.
  • BIC is best for explanation as it allows consistent estimation of the underlying data generating process.**

Edit: One example can be found in Time Series analysis. In VAR models the AIC (as well as its corrected version the AICc) often take to many lags. Therefore one should primarily look at the BIC when choosing the number of lags of a VAR Modell. For further information you can read chapter 9.2 from Forecasting- Principles and Practice by Rob J. Hyndman and George Athana­sopou­los.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
  • Can you add more, please? In particular, cannot BIC be used to converge on an appropriate prior given its post-hoc? I appreciate the answer, thanks. BTW, "too" complex not "2". Strangely limited not temporal concept of "prediction" seems limited to predicting only in the sense of interpolation of values from a nearly identical range of withheld values. Usually the word prediction would apply to extrapolation beyond the range of an observed time series, which is not what either cross-validation or AIC are especially good at. Maybe the term "predicted interpolation" should be used. – Carl Sep 30 '16 at 15:00
  • The bold text is a one to one citation from Rob Hyndman, who is a famous statistics professor from Australia. I think by "prediction" he means "inference". So the AIC would be more useful for inferential statistics while the BIC would be more useful for descriptive statistics. – Ferdi Oct 01 '16 at 12:39
  • Yes, prolific as well. Still, what I am asking for is one good example of what AIC cannot do that BIC can. – Carl Oct 01 '16 at 15:40
  • 1
    @Ferdi, no, definitely "prediction" does not mean "inference" in that blog post. "Prediction" is "prediction", or "forecasting" where you don't care whether your model is "correct" (in some sense) as long as it forecasts well. Following that post, it seems that BIC is the preferred one for inference. – Richard Hardy Nov 28 '16 at 09:50
  • Thank you for your reply. Prediction or Forecast is "inferring" from observed data on "non-observed data". – Ferdi Dec 01 '16 at 09:40
  • @RichardHardy I think in general you might mean AIC is good for some aspect of predictive interpolation, perhaps some measure of self-consistency, although frankly leave one out, the type cross-validation that applies, is less that stellar for parameter estimation. – Carl Dec 12 '16 at 23:28
  • @Carl, thanks for the comment. I meant exactly what I wrote there. But feel free to have a different opinion on the subject (though not on what I was actually thinking). Also, an example of what AIC cannot do that BIC can: select the true model with probability one (consistent model selection), under appropriate assumptions. – Richard Hardy Dec 13 '16 at 13:30
  • @RichardHardy This is a difficult topic, that I want to understand. Prediction of what? Which appropriate assumptions? Any hint appreciated. – Carl Dec 14 '16 at 18:29
  • @Carl, I am like you in this respect; I find it difficult myself and I am eager to learn more. By *prediction* I mean the point forecast of $y$ given $X$ (from a model that specifies the conditional distribution $P_Y(y|X)$). I do not have handy a list of assumptions for BIC being a consistent model selection criterion, but they are certainly available in textbooks and research papers dealing with the topic. I would try Burnham & Anderson “Multimodel inference: Understanding AIC and BIC in Model Selection” (2004) and Claeskens "Statistical model choice” (2016) for starters (great papers, IMHO). – Richard Hardy Dec 14 '16 at 19:55
  • @RichardHardy Then you must allow that AIC is biased, that is, a minimum error point forecast of $y$ given $X$ is not a best function model and will not retrieve a model upon Monte Carlo simulation, but is a best interpolation (of the split the difference (i.e., of $y-f(x)$ error) over the range of data type). If BIC can, then it is finding a best $y=f(x)$, which is not a best interpolation but which is actually better for extrapolation. – Carl Dec 14 '16 at 21:07
  • @Carl, This is addressed in Burnham & Anderson, I sincerely recommend taking a look. – Richard Hardy Dec 14 '16 at 21:21
  • @RichardHardy OK, obtained a copy and will read, but that could take a while. Also, I have seen criticism of that book, so will take with a grain of salt. – Carl Dec 15 '16 at 03:01
  • You mentioned time series. I have my doubts that maximum likelihood (ML) is applicable to time series without fear of contradiction as the coverage is not complete (i.e., $t_{max}\neq \infty$). I can understand applicability of ML to random variates (RV), but time series are not RV. Please explain. – Carl Jan 31 '19 at 22:47
2

It is not meaningful to ask the question whether AIC is better than BIC. Even though these two different model selection criteria look superficially similar they were each designed to solve fundamentally different problems. So you should choose the model selection criterion which is appropriate for the problem you have.

AIC is a formula estimates the expected value of twice the negative log likelihood of test data using a correctly specified probability model whose parameters were obtained by fitting the model to training data. That is, AIC estimates expected cross validation error using a negative log likelihood error. That is, $AIC \approx E\{-2 \log \prod_{i=1}^n p(x_i | \hat{\theta}_n)\}$ Where $x_1, \ldots, x_n$ are test data, $\hat{\theta}_n$ is estimated using training data, and $E\{ \}$ denotes the expectation operator with respect to the iid data generating process which generated both the training and test data.

BIC on the other hand is not designed to estimate cross validation error. BIC estimates twice the negative logarithm of the likelihood of the observed data given the model. This likelihood is also called the marginal likelihood it is computed by integrating the likelihood function weighted by a parameter prior $p(\theta)$ over the parameter space. That is, $ BIC \approx -2 \log \int [\prod_{i=1}^n p( x_i | \theta) ] p(\theta)d\theta$.

RMG
  • 323
  • 2
  • 8
  • 1
    Some proponents of AIC versus BIC are so enamored of their opinions that they remind me of Democrats versus Republicans in the US. The question posited is a practical one as these armed camps often review scientific journal articles, and indeed a more relevant question is whether maximum likelihood is appropriate at all in the circumstances in which it tends to be applied. – Carl Jan 31 '19 at 22:53
  • BTW (+1) for contributing to discussion. Would like to see more about whether either AIC or BIC is applicable to when they tend to be used, but that is, admittedly, a separate question. – Carl Jan 31 '19 at 23:15
2

Q: Are there circumstances in which BIC is useful and AIC is not?

A: Yes. BIC and AIC have fundamentally different goals. BIC estimates the probability that a model minimizes the loss function (specifically, the Kullback-Leibler divergence); a BIC difference of .1 between A and B implies that model A is roughly 10% more likely to be the best model -- assuming you start with close to no information and have a large sample size. AIC, by contrast, measures how good a model is at making predictions -- a difference of .1 means (very roughly) that model A will be about 10% better at making new predictions than model B.

This means that BIC can be better if you want to know the probability that a model is true. AIC can't give you that; if you try using AIC in this way, you get inconsistent estimates (i.e. AIC will not always select the true model).

On the other hand, AIC will be better at minimizing the expected loss.

AIC and BIC have two fundamentally different goals (BIC tries to maximize the chances of picking the best model, while AIC tries to maximize the expected quality of the model you select).

  • Yeah, OK, but I see a problem with AIC and BIC as well as R$^2$ and other measures that seems to have slipped under the radar. Briefly, as a physical analogy, self-entropy is not a measure of the environment in which a hypothetical "brick" finds itself. The correct measurement would be temperature. Mathematically, one cannot correct a zero by multiplying it by some constant. For example, I can always find a polynomial that fits any 2D data set perfectly. That would give an R$^2$ of 1, an unexplained variance of 0, and BIC and AIC values that are better than any model having fewer parameters. – Carl Oct 22 '21 at 20:45
  • cont... The problem is that of overfitting. In other words, a perfect match between model and data is one that has no modelling error, but it can still have noise, and one can generally assume that total error (as variance) is the sum of modelling error and data noise, Thus, a hypothetically perfectly matched physical model would have no modelling error, but would neither would it have an R$^2$ of 1 nor necessarily the ideal AIC and BIC values. Sure, sometimes one man's noise is another man's more complete model, but that still does not discount the overfitting problem. – Carl Oct 22 '21 at 20:45
  • P.S. Said more simply, least self-entropy is not necessarily the most appropriate self-entropy for the context of a problem. +1 for your answer. – Carl Oct 22 '21 at 21:17
  • @Carl I don't know about the physical analogy (I am not a physicist), but I think what you're saying about AIC/BIC having problems is correct. AIC and BIC are approximations that work in the limit with infinite data and few parameters (the number of data points per parameter has to be large for the approximation to hold). AICc partially corrects for this by adding another term, so it works better in smaller samples, but a perfect calculation would require an infinitely long series. – Closed Limelike Curves Oct 23 '21 at 19:24
  • BTW, the usual calculation for R^2 uses a correction for the number of explanatory variables. This correction makes R^2 undefined when you have as many parameters as data points (as in your polynomial example). [Link](https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2) – Closed Limelike Curves Oct 23 '21 at 19:31
  • For adjusted-R$^2$ to mean something, R$^2=1$ would have to mean something. As R$^2=1$ can be produced in multiple ways, it does not distinguish between those models, so that adjusting it is also futile. The problem is that a perfect model to noisy data neither has R$^2=1$ nor can it be shown to be perfect by minimizing a KL loss function because overfitting will yield a lower KL loss. This happens notably in simulations as the degrees of freedom decrease without ever getting to 0. – Carl Oct 23 '21 at 20:23
  • @Carl You're right that adjusted R^2 doesn't mean anything when R^2 equals 1 -- because adjusted R^2 is undefined when you get a perfect fit from k >= n-1 parameters. If you get a perfect fit with less than n-1 parameters, that's not just interpolation anymore, it's a valid result (assuming it's not a post-hoc hypothesis). With regards to KL loss function, it depends on what estimator you're using for KL loss. If you use AICc, a perfect fit with k=n-1 parameters is [undefined](https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_small_sample_size). – Closed Limelike Curves Oct 24 '21 at 02:18
  • Actually, it's worse than that. I wouldn't expect R$^2$ adjusted or not, AICc, AIC or BIC likely to work for $k=n-2$ either because of routine overfitting of part of the noise in the data. Moreover, *df* can be fractional. So that one can get very close to but less than $k=n-1$, and things fall apart before that. Example, suppose we have a model and know (from assay) that the error is 5% (e.g., rmse). We fit our model and find two results error 3% and error 6%. For one model with the 3% error we have $n-1=10$ and twelve 2D data pairs. For the other 6% error model we have $n-1=4$... – Carl Oct 24 '21 at 05:04
  • cont... for the same data pairs. Which model does one trust? If we truly have a 5% error of measurement, we cannot produce a correct model with less error than that because it would mean that our modelling error had a negative variance. One can certainly produce results like that, but they are not real. Now none of this is apparent when R$^2$, AIC etc. are used. – Carl Oct 24 '21 at 05:10
  • All this may be as clear as mud. Let's say it this way. Given that there is a proliferation of [different methods of adjusting R$^2$](https://stats.stackexchange.com/q/508268/99274), one suspects that no one of them is entirely satisfactory. Ditto for AIC, AICc, BIC, and a half-dozen others. – Carl Oct 24 '21 at 07:44
  • @Carl yes, you're right that all of these are approximations that start to break down for large numbers of parameters. That being said, the differences are generally small in practice. – Closed Limelike Curves Oct 24 '21 at 16:05
  • If you want to deal with this in a completely principled way that takes into account fractional degrees of freedom and works even for very small sample sizes, you can use leave-one-out cross-validation instead -- AIC and AICc are both asymptotic approximations of leave-one-out cross-validation using K-L divergence loss. WAIC handles fractional degrees of freedom directly, but breaks down for small samples just as much as AIC does. – Closed Limelike Curves Oct 24 '21 at 16:08
  • And are not useful for comparing different regressions methods, non-nested models, etc. Too many ifs for my tastes. – Carl Oct 25 '21 at 10:50
  • @Carl You can absolutely use AIC, WAIC, BIC, etc. to compare different regression methods and non-nested models, although there's a popular misconception that you can't. Pretty much the only assumption you're likely to break in practice is IID observations (which you also need for cross-validation). The common misconception that you need nested models comes from the fact that you need these for statistical significance tests, but AIC/BIC/etc. are not statistical significance tests. – Closed Limelike Curves Oct 26 '21 at 23:16
  • Well, maybe sometimes you can, but sometimes not. Here is a [counter example](https://stats.stackexchange.com/a/369851/99274). – Carl Oct 28 '21 at 03:45
  • Another quibble [here](https://stats.stackexchange.com/a/376064/99274). In particular, relying on asymptotic correctness is problematic for $n<100$ for some authors and even more for others $n<200$. Moreover, your comment about different regression methods is problematic as the only method for which AIC BIC and the like are applicable is for maximum likelihood, which for ordinary least squares is appropriate only in the linear case IFF the residuals are maximum likelihood modeled using the appropriate residual model, e.g., normal, student's t, etc. – Carl Oct 28 '21 at 04:29
  • cont... to the data set under consideration. I really don't see how such a model can be easily applied to weighted least squares non-linear regression methods with heteroscedastic residuals. Much easier to create simpler methods of examining goodness of fit, I would venture. – Carl Oct 28 '21 at 04:33
  • @Carl You can definitely compare nonlinear, weighted least squares heteroscedastic models using AIC -- a model that doesn't have heteroscedasticity included will usually be a poor fit to heteroscedastic data and this will show up in the log-likelihood term of AIC. The WLS estimator is just a maximum likelihood estimator that assumes variables follow Gaussian dists with different variances. – Closed Limelike Curves Nov 01 '21 at 01:29
  • You're right that AIC+BIC require pretty large sample sizes to be accurate, which is why it's pretty much always better to use AICc. If you want to compare models trained using different sample sizes, you need to divide the AIC by the sample size n before comparing models, which makes AIC into a scale-free estimator of the K-L divergence. That being said, this will make AIC much noisier, and you can't use it to generate Akaike weights anymore. – Closed Limelike Curves Nov 01 '21 at 01:39
  • Bad models don't often have Gaussian residuals. Bad models do not tend to have the same distributions as good models having the same number of parameters, which is one reason among many why I said that comparison of non-nested models is problematic. You also said something that I did not understand, that is, that weighted least squares maps to AIC. However, if one is not minimizing the logarithm of the dependent variable(s), how does that map to maximum likelihood to then map to AIC? – Carl Nov 01 '21 at 14:03
  • For example, I use Nelder-Mead for non-convex problems quite frequently, and the more typical minimization algorithms do not work. Maximum likelihood would not work as a tool for finding a minimum in the non-convex case, or at least I have no idea how to do that. So what use is AIC in that context? It is not an option that I have seen for Nelder-Mead. – Carl Nov 01 '21 at 14:18
  • First -- Gaussian residuals aren't necessary for AIC. You just need the likelihood to be approximately Gaussian, which it will be for large sample sizes regardless of the model. Second, WLS is just maximum likelihood estimation where the variable is a Gaussian with a variance that is not constant, so you can apply AIC to determine the quality of a fit there (just like you can apply AIC to OLS). – Closed Limelike Curves Dec 04 '21 at 03:37
  • Nelder-Mead can be used for least squares, but my use of it is for non-linear models. For the data I am currently processing, which is non-convex, Maximum Likelihood (ML) regression is divergent, it does not, and can not converge. Moreover, no gradient based method that I have tried converges. Nelder-Mead, which is a general search technique, which unlike gradient based methods, does converge. Without a gradient, AIC cannot be calculated directly, and calculating one indirectly would be unmotivated. Residuals are not asymptotically normal when all models are wrong. – Carl Dec 04 '21 at 05:32
  • Residuals don't have to be asymptotically normal; the only thing that needs to be asymptotically normal is the likelihood. The likelihood is asymptotically normal pretty much 100% of the time, even when the residuals aren't normal and the model is wrong -- you need a truly pathological case, like trying to fit a normal distribution to a Cauchy-distributed variable, for this to fail. – Closed Limelike Curves Jan 11 '22 at 21:45
  • Which optimization method you use (Nelder-Mead or anything else), or the availability of gradients, is unrelated to whether you can use AIC. AIC is just the log-likelihood of the data evaluated at your estimate, minus the number of parameters. – Closed Limelike Curves Jan 11 '22 at 21:49
  • Log-likelihood is applicable to random variate data. However, if the x-axis data is not random, and is censored as would occur for drawing blood samples at predetermined times, I do not see how log-likelihood then relates to a likelihood. Moreover, if the loss function for a nonlinear model is, for example, logarithmic for one regression and something else, e.g., OLS for another, then one would have to transform the AICc or similar results to be on the same scale prior to comparison between results, moreover none of that solves the overfitting problem. – Carl Jan 11 '22 at 22:56
  • If you're fitting via least-squares, that's equivalent to choosing a Gaussian likelihood. – Closed Limelike Curves Jan 20 '22 at 22:57
  • In the linear case, perhaps. If you think this applies generally, then show me, please. However, an assumption is not a proof and in my work, at least, I have noted that only when the model I use is stellar do I get quasi-Gaussian residuals, but when the model is less perfect, I have gotten residuals that are distinctly non-normal like a three parameter Weibull distribution. How one can compare two differently distributed sets of residuals with a single parameter I wouldn't know. One is probably better off comparing the residual distributions' parameters themselves, no? – Carl Jan 21 '22 at 03:57
  • I think you're confusing the residuals with the likelihood function. In general, these two aren't related. The residuals don't need to be Gaussian for the likelihood to be approximately Gaussian. The reason is a theorem from Laplace, which guarantees that any function of the form exp(N * f(x)) is approximately Gaussian for large N, as long as f(x) satisfies mild regularity conditions. In statistics, f(x) is the log-likelihood, and N is the sample size: https://en.wikipedia.org/wiki/Laplace%27s_method In general this holds when your data have finite variance. – Closed Limelike Curves Jan 21 '22 at 17:59
  • Understood. However, not every regression is done with log-likelihood of random variates. I think that maximum likelihood would not apply, for example, to censored time-samples using sample-times chosen non-randomly by an experimenter. Moreover, the dependent variable error structure may be other than log distributed such that maximum likelihood regression may be biased and inefficient, or at least I seem to have observed that much. – Carl Jan 22 '22 at 18:26
  • Oh absolutely, maximum-likelihood can be a very biased and inefficient estimator, especially when using large numbers of parameters. That being said, you don't need to be actually using the maximum-likelihood estimator -- AIC is often framed that way, but the general principal just requires finding the log-probability of observing your data, as given by some model, then subtracting a correction term. – Closed Limelike Curves Jan 22 '22 at 19:02
  • You can definitely do censored samples and non-random sample-times using likelihood functions -- in fact, it's much easier IME to deal with these when you can use the likelihood principle to screen off a lot of decisions that don't really matter, which is why I usually work with these kinds of things in a Bayesian framework. – Closed Limelike Curves Jan 22 '22 at 19:04
  • I think there is a problem with AIC from the theory of minimizing self-entropy whereas a perfect model despite noise would have optimized, not minimized, self-entropy. AICc and BIC would not correct for that any more than adjusted R-squared would, especially when the number of parameters approaches the number of samples, and the difference between minimizing and optimizing becomes obvious. – Carl Jan 22 '22 at 21:10