8

Asymptotically, minimizing the AIC is equivalent to minimizing the leave-one-out cross-validation MSE for cross-sectional data [1]. So when we have AIC, why does one at all use the method of dividing the data into training, validation and test sets to measure the predictive properties of models? What specifically are the benefits of this practice?

I can think of one reason: if one wants to assess the models' predictive performances, out-of-sample analysis is useful. But although AIC is not a measure of forecast accuracy, one usually has a good idea if some model is reaching its maximum potential (for the data one is given) in terms of how well you are gonna be able to predict.

Marcelo Ventura
  • 1,433
  • 11
  • 21
Erosennin
  • 1,384
  • 17
  • 31
  • 2
    An excerpt from [sklearn's docs](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html): *Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).* – sascha May 27 '16 at 11:48
  • I do not actually think that AIC assumes a correct model (http://stats.stackexchange.com/questions/205222/does-bic-try-to-find-a-true-model). Regarding sample size and AIC being an asymptotic result: you would *never* divide your data into three parts when you have little data. So small sample size is problematic for *both* out-of-sample analysis and AIC – Erosennin May 27 '16 at 11:58
  • I'm not an expert, but these [slides](http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf) indicate, that AICs theoretical properties are dependent on "how good" the model is. There is also an alternative information-criterion which seems to be more robust regarding this aspect (at some costs). – sascha May 27 '16 at 14:34
  • Yes, I know what AIC *is* and *means*. I'm not getting how these links you provide are helping with the question posed. – Erosennin May 27 '16 at 14:47
  • Well, both links say, that AIC-based model-selection lose some theoretical properties (performs worse) when the model is "not good" which is not the case for leave-one-out CV / k-fold CV. How does this not target your question (irrespective of your trust on this statement)? – sascha May 27 '16 at 15:08
  • As I said, I believe there is an erroneous statement in your first link regarding assuming the model is "correct". The best model is the model with the least KL information loss, and a "not good" model is a model with more KL information loss. In terms of the properties of AIC I cant remember such a thing that AIC "breaks down" when handling a "not good" model. "Badly conditioned" does not mean a "not good" model. As far as I can see your second link does not shed any new light on this either. If you could specify what exactly you mean by "lose some theoretical properties .. ctd – Erosennin May 27 '16 at 16:50
  • ...ctd ... when the model is "not good"...)? To me, this does not make any sense. It does not target my question because AIC does not work any *less good* when handling bad models (defined in terms of KL information loss). – Erosennin May 27 '16 at 16:53
  • 1
    @sascha has a point there: for AIC to approximate expected KL info. loss *well* one of the models has to be fairly good. I don't think anyone advocates using AIC to compare bad models to see which is less bad. – Scortchi - Reinstate Monica May 29 '16 at 18:09
  • @Scortchi: In the derivation of the AIC, or any treatment I've read on AIC, I've yet to see this been mentioned. Could you please show me where this is stated? (I see no such thing in the two links provided) – Erosennin May 29 '16 at 18:18
  • 2
    $\operatorname{tr}(J(\theta_0)(I(\theta_0))^{-1}) \approx k$ in slide 10 that @sascha linked to. (I was just looking on our site - we seem to have a lot of assertions about AIC, & references containing yet more assertions; but little beyond. From memory, Pawitan, *In All Likelihood*, & Burnham & Anderson, *Model Selection*, give derivations.) – Scortchi - Reinstate Monica May 29 '16 at 18:22
  • 1
    Ok, I skipped the TIC-part and missed that bit. You are absolutely right. Apologies to you @sascha , and thank you for enlightening me :) Yes, I just had a look in Burnham & Anderson myself. Great resource! – Erosennin May 29 '16 at 18:41

1 Answers1

9

In practice, I always use cross-validation or a simple train–test split rather than AIC (or BIC). I'm not too familiar with the theory behind AIC, but two chief concerns lead me to prefer more direct estimates of predictive accuracy:

  1. The number itself doesn't tell you much about how accurate a model is. AIC can provide evidence as to which of several models is the most accurate, but it doesn't tell you how accurate the model is in units of the DV. I'm almost always interested in concrete accuracy estimates of this kind, because it tells me how useful a model is in absolute terms, and also how much more accurate it is than a comparison model.

  2. AIC, like BIC, needs for each model a parameter count or some other value that measures the model's complexity. It isn't clear what you should do for this in the case of less traditional predictive methods like nearest-neighbor classification, random forests, or the wacky new ensemble method you scribbled onto a cocktail napkin midway through last month's bender. By contrast, accuracy estimates can be produced for any predictive model, and in the same way.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
  • 1
    +1 Great! #2 is a great argument! #1 adresses what I write about AIC not being a measure of forecast accuracy, great point! Can I ask how you compare "how much more accurate it is than a comparison model"? I recently thought about this when comparing two models' MSE. MSE of Model 1 and Model 2 was 10 and 20, respectively. How do I interprect *how much* more accurate Model 1 is? I'm thinking it cannot be as simple as 20/10, because comparing this must/should take the scale of the DV into account? – Erosennin May 29 '16 at 16:45
  • 2
    I just look at both of the respective accuracy figures (MSE or whatever), rather than trying to make a comparison score. Also, it always helps to have an accuracy score for a trivial model (i.e., a model which uses no predictors) if that wasn't already one of the models you were comparing. – Kodiologist May 29 '16 at 16:49
  • (+1) There's a cottage industry in inventing effective AICs, quasi-AICs, & the like for situations that aren't maximum-likelihood estimation with a fixed no. parameters. – Scortchi - Reinstate Monica May 29 '16 at 17:09
  • @Kodiologist: I think it would be very interesting with a comparison score. This way we can compare models made on different data sets, e.g. evaluate performance of old models vs new models when new data is available. – Erosennin May 31 '16 at 11:02
  • In respect of 2. there's a relatively easy way to get the degrees of freedom of the model (though in some cases it may be moderately time consuming to compute, in many common situations there's a shortcut); which is $k=\sum_i \frac{\partial \hat{y}_i}{\partial y_i}$; in a quite literal direct sense this measures the model's degrees of freedom to approximate the data. See for example Ye's 1998 JASA article. StasK links to a full reference in [this](http://stats.stackexchange.com/questions/57027/what-does-degree-of-freedom-mean-in-neural-networks) answer for example. ... ctd – Glen_b Jun 07 '16 at 00:34
  • ctd... It corresponds to the usual notion of degrees of freedom in common cases. e.g. if you can write $\hat{y}=Hy$, such as with a linear smoother, then that d.f. is just the usual trace of $H$. The nice thing about this approach is that all you really need is the ability to produce a fitted value for each observation. – Glen_b Jun 07 '16 at 00:38