7

On scikit-learn documentation, I found the following comments about AIC:

Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

My questions are:

  1. Why would AIC break when we have more features than samples?
  2. Why is AIC and BIC commonly used in forecasting model like ARIMA?
Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Shan Dou
  • 445
  • 2
  • 8
  • 4
    *assume the model is correct* does not belong there. – Richard Hardy May 10 '21 at 05:19
  • 1
    Here is why information criteria may be preferred to cross validation in time series: ["AIC versus cross validation in time series: the small sample case"](https://stats.stackexchange.com/questions/139175/aic-versus-cross-validation-in-time-series-the-small-sample-case). – Richard Hardy May 10 '21 at 07:54
  • @RichardHardy AIC requires that model specification (the functional form) is correct. This is in fact what is fixed in TIC: https://www.ssc.wisc.edu/~bhansen/718/NonParametrics14.pdf – Cagdas Ozgenc Sep 26 '21 at 14:18
  • @CagdasOzgenc, as far as I remember this is not the case. In fact, I would say the lack of such an unrealistic requirement is one of the hallmarks of AIC. Perhaps Hansen is discussing a special case or a special use of AIC? – Richard Hardy Sep 26 '21 at 14:49
  • @RichardHardy Very few people truly understood the derivation in my opinion. The terms in “true risk” and “empirical risk” don’t really cancel out when truth is not in search space. See faculty.washington.edu/yenchic/19A_stat535/Lec7_model.pdf and https://ejwagenmakers.com/2003/elephant.pdfEx pp582 – Cagdas Ozgenc Sep 26 '21 at 16:40
  • @CagdasOzgenc, I personally know a couple of people who certainly *did* understand the derivations. They have written [a book](https://www.cambridge.org/core/books/model-selection-and-model-averaging/E6F1EC77279D1223423BB64FC3A12C37) about the matter. If you can find the requirement in the book, I will believe it. I flipped through the pages just now (Chapters 2 and 4 are the relevant ones) but did not find it. I looked at Wagenmakers and only found this *The original derivation of the AIC assumed that the [DGP] is among the set of candidate models*, but it does not prove your point. – Richard Hardy Sep 26 '21 at 17:03
  • @CagdasOzgenc, I read Yen-Chi Chen, too. How exactly does he support your point? – Richard Hardy Sep 26 '21 at 17:12
  • @RichardHardy I cannot find the original lecture notes from Wasserman where it was more clearly explained. However the bottom line is in Chen’s notes the quadratic terms of Taylor approximation don’t cancel each other unless the model attains the truth. This difference is equal to the quadratic adjustment in TIC (Takeuchi was a student of Akaike). A few authors tried to get an estimate of this adjustment but it made things worse due to higher estimation errors, making TIC practically useless. – Cagdas Ozgenc Sep 26 '21 at 18:12
  • @RichardHardy https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3139945 This paper has derivations in detail. Conclusion is at top of page 12. Matrices I and J are not equal and don’t end up with the trace of identity matrix yielding the K in AIC when the search space doesn’t contain the truth – Cagdas Ozgenc Sep 26 '21 at 18:33
  • @CagdasOzgenc, thank you, I appreciate your help and the references. If you happen to find Wasserman's lecture note, I will be interested in it, too. – Richard Hardy Sep 26 '21 at 18:33
  • @RichardHardy In addition to what I just posted above there is a brand new paper https://www.sciencedirect.com/science/article/pii/S0167715221000262#b11, as a separate topic, fixes AICc for random regressors (as opposed to fixed regressors) which is more relevant in economics. In general application of AIC to autoregressive models was broken from the beginning and very few authors mentioned this. – Cagdas Ozgenc Sep 26 '21 at 18:42
  • @RichardHardy I think this is the lecture note from the person who wrote the book you linked. http://www.math.rug.nl/stat/models/files/claeskens.pdf They also support my point, slide number 32 – Cagdas Ozgenc Sep 27 '21 at 10:39
  • @CagdasOzgenc, I think slides 32-33 support your point about how TIC is superior to AIC but say nothing about efficience. – Richard Hardy Sep 27 '21 at 10:45
  • @RichardHardy I don’t understand what you are saying. If the search space is not very close to the truth TIC improves over AIC. I think what people are really trying to say is if we use a rich enough model then even if the exact form is not known it will be approximated well enough not to make AIC any worse compared to TIC. However this is still only true asymptotically. On small samples we know that the form of the functional matters as presented in MDL work – Cagdas Ozgenc Sep 27 '21 at 10:55

3 Answers3

7

First off, as Richard Hardy comments, information criteria do not assume we have the true model. Quite to the contrary. For instance, AIC estimates the Kullback-Leibler distance between the proposed model and the true data generating process (up to an offset), and picking the model with minimal AIC amounts to choosing the one with the smallest distance to the true DGP. See Burnham & Anderson (2002, Model selection and multi-model inference: a practical information-theoretic approach) or Burnham & Anderson (2004, Sociological Methods & Research) for an accessible treatment. They also go into the justification for BIC.

Information criteria break down with overparameterized models, but that's not really a problem of the ICs. Instead, it's that every overparameterized model that is not regularized breaks down, and that "normal" ICs don't work with regularized models. (I believe there are IC variants that apply to regularized models, but am not an expert in this.)

ICs are used in forecasting model selection because of the above argument about distances to true DGPs. A related argument is that the AIC asymptotically estimates a monotone function of the prediction error (section 4.3.1 in Lütkepohl, 2005, New Introduction to Multiple Time Series Analysis, who also goes into other model selection criteria). Also, ICs are not the only tool used: some people prefer using holdout sets, but that means you need more data.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Regarding *AIC asymptotically estimates a monotone function of the prediction error*, I have a related question: ["Equivalence of AIC and LOOCV under mismatched loss functions"](https://stats.stackexchange.com/questions/406430) (and more related questions linked in there). – Richard Hardy May 10 '21 at 08:24
  • I think the primary reason is that there are few if any good alternatives. The real issue is which information criterion you use, AIC, BIC, etc. There are differing opinions on that. Note that some use hold out data sets (and MAPE, MSE etc)to choose which model is best, I do, but that is not considered a statistical approach I assume. – user54285 May 10 '21 at 23:21
  • Burnham is wrong. The fact that somebody wrote a book doesn’t make them right. AIC requires that model specification (the functional form) is correct. This is in fact what is fixed in TIC: ssc.wisc.edu/~bhansen/718/NonParametrics14.pdf – Cagdas Ozgenc Sep 26 '21 at 14:19
7

What alternatives do we have in model selection for prediction?

  • The main ones are cross validation and information criteria.

Why are the latter attractive in the time series setting?

  • Information criteria are less computationally intensive. You only need to fit the model once to calculate an information criterion. This is in contrast to most applications of cross validation. Computational efficience is extra desirable in the time series setting as many basic time series models (ARMA, GARCH and the like) tend to be rather computationally demanding (more so than, say, linear regression).
  • Information criteria are also more effective in utilizing the data, as the model is estimated on the entire sample rather than just a training subset. The latter is important in small data sets* and especially in time series settings. In small data sets, we do not want to leave out too much data for testing, as then there is very little data left for training/estimation. We have leave-one-out cross validation (LOOCV) which leaves out only a single observation at a time in training/estimation, and it works well in a cross-sectional setting. However, it is often inapplicable in the time series setting due to the mutual dependence of the observations. Other types of validation that are applicable are much more data-costly. For more details, see "AIC versus cross validation in time series: the small sample case".

*Information criteria have an asymptotic justification, so their use is not unproblematic in small samples. Nevetherless, a more efficient use of the data is more desirable than a less efficient use. By using the entire sample for estimation you are closer to asymptotics than by using, say, 2/3 of the sample.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
0

First of all, sorry this was supposed to be a comment as opposed to an answer. The question has already been answered well. I just wanted to add that even though ICs aim at minimizing the distance to the true DGP, they might not always be able to do so. True DGP is unknown and there is no best way to identify the model closest to it. However, you can aid the ICs with autocorrelation and partial autocorrelation functions. Just by looking at these plots will give you an idea of how your model should look like in terms of lags. This will narrow down your pool of candidate models and you can then select the one with lower IC. In my understanding ICs look at how the models fit the distribution of the data but do not incorporate how the data is distributed over time. Incorporating auto-/partial autocorrelation plots helps to bridge the gap. Would love to be corrected if I am wrong.

Abbas
  • 3
  • 3
  • I think you are in fact wrong. ICs are based on the likelihood, and the likelihood accounts for all of the things you mention. – Richard Hardy Sep 26 '21 at 14:52