Where is the divide between information criterion (AIC, BIC, etc...) and cross validation?

Question

I've taken a regression class and am now in a machine learning class. In regression, we talk about model selection using adj-R2 and AIC/BIC. In my machine learning class, we primarily select models using cross validation (or sample splitting) and the validation error.

I am not seeing the connection between the two, or more specifically the divide between the two. Seems like cross validation/sample splitting can work on any model, so why even bother with adj-R2 or AIC/BIC?

When would we want to use one over the other? And are there situations where AIC/BIC wouldn't work?

Thanks

Richard Hardy · Accepted Answer · 2020-07-17T19:28:31.010

5

The question is quite broad, but I will give some starting points:

Why bother with AIC/BIC: using cross validation (CV) is (much) more computationally expensive than using AIC/BIC, except for some special cases like leave-one-out cross validation (LOOCV) for regression where it is computationally as cheap as AIC/BIC.

Situations where AIC/BIC would not work: AIC/BIC are only available for models estimated using maximum likelihood estimation (MLE), and this is a relative small class of models in the context of machine learning.

Connection between CV and AIC/BIC: under some assumptions, AIC is asymptotically equivalent to LOOCV while BIC is asymptotically equivalent to k-fold CV with a specific fold size that depends on the sample size. So under these assumptions, you can save a lot of computations by replacing CV with AIC/BIC.

On $R^2_{adj.}$: According to "Justification for and optimality of $R^2_{adj.}$ as a model selection criterion", it is questionable whether $R^2_{adj.}$ can be regarded as an optimal model selection criterion. Personally, I would not use it when other alternatives like AIC, BIC or CV are available.

edited Jul 17 '20 at 19:28

answered Feb 09 '20 at 11:27

Richard Hardy

54,375
10
95
219

LOOCV is computationally cheap? How is that not way more expensive than 5-fold CV? If I have 1000 observations, I have to fit 1000 regression models to run LOOCV. – Dave Feb 09 '20 at 12:49
1

@Dave, no, there is a neat workaround that does LOOCV for regression in just one fit, using the hat matrix (I think; or maybe another one) to obtain the errors for each of the $n$ folds. This only works for regression, though. There might be another thread describing it, otherwise you should be able to find it in a textbook. – Richard Hardy Feb 09 '20 at 12:55
Ah thanks, the second one is a good reason that I am not aware about. As my classes aren't taught by the same professor, there is no flow. Just suddenly we switched from AIC to CV. I would imagine AIC might be more useful in situations where we have time series data. – confused Feb 10 '20 at 18:32
@confused, yes, in a time series setting, CV becomes conceptually more complicated, while AIC/BIC are still as simple as they are otherwise. – Richard Hardy Feb 11 '20 at 06:49

Where is the divide between information criterion (AIC, BIC, etc...) and cross validation?

1 Answers1

Linked