27

In a question elsewhere on this site, several answers mentioned that the AIC is equivalent to leave-one-out (LOO) cross-validation and that the BIC is equivalent to K-fold cross validation. Is there a way to empirically demonstrate this in R such that the techniques involved in LOO and K-fold are made clear and demonstrated to be equivalent to the AIC and BIC values? Well commented code would be helpful in this regard. In addition, in demonstrating the BIC please use the lme4 package. See below for a sample dataset...

library(lme4) #for the BIC function

generate.data <- function(seed)
{
    set.seed(seed) #Set a seed so the results are consistent (I hope)
    a <- rnorm(60) #predictor
    b <- rnorm(60) #predictor
    c <- rnorm(60) #predictor
    y <- rnorm(60)*3.5+a+b #the outcome is really a function of predictor a and b but not predictor c
    data <- data.frame(y,a,b,c) 
    return(data)    
}

data <- generate.data(76)
good.model <- lm(y ~ a+b,data=data)
bad.model <- lm(y ~ a+b+c,data=data)
AIC(good.model)
BIC(logLik(good.model))
AIC(bad.model)
BIC(logLik(bad.model))

Per earlier comments, below I have provided a list of seeds from 1 to 10000 in which AIC and BIC disagree. This was done by a simple search through the available seeds, but if someone could provide a way to generate data which would tend to produce divergent answers from these two information criteria it may be particularly informative.

notable.seeds <- read.csv("http://student.ucr.edu/~rpier001/res.csv")$seed

As an aside, I thought about ordering these seeds by the extent to which the AIC and BIC disagree which I've tried quantifying as the sum of the absolute differences of the AIC and BIC. For example,

AICDiff <- AIC(bad.model) - AIC(good.model) 
BICDiff <- BIC(logLik(bad.model)) - BIC(logLik(good.model))
disagreement <- sum(abs(c(AICDiff,BICDiff)))

where my disagreement metric only reasonably applies when the observations are notable. For example,

are.diff <- sum(sign(c(AICDiff,BICDiff)))
notable <- ifelse(are.diff == 0 & AICDiff != 0,TRUE,FALSE)

However in cases where AIC and BIC disagreed, the calculated disagreement value was always the same (and is a function of sample size). Looking back at how AIC and BIC are calculated I can see why this might be the case computationally, but I'm not sure why it would be the case conceptually. If someone could elucidate that issue as well, I'd appreciate it.

russellpierce
  • 17,079
  • 16
  • 67
  • 98
  • +1 The code would be simple to write, still I'm very interested in seeing a clear, illustrative dataset. –  Jul 25 '10 at 12:31
  • I'm not sure what all would need to be in a clear and illustrative dataset, but I've made an attempt to include a sample dataset. – russellpierce Jul 26 '10 at 05:50
  • So look: what you provided is an example of an useless set, because the BIC and AIC give the same results: 340 v. 342 for AIC and 349 v. 353 for BIC -- so good.model wins in both cases. The whole idea with that convergence is that certain cross-validation will select the same model as its corresponding IC. –  Jul 26 '10 at 07:53
  • I've made a simple scanning and for instance for seed 76 the ICs disagree. –  Jul 26 '10 at 09:45
  • Thanks mbq - I didn't understand what you meant/needed in terms of an illustrative dataset. Also, in general, in terms of empirical demonstration I'd imagine that a single example from a single seed won't really do the trick. I was imagining something like a metric from the employment of a cross-validation method being shown to be correlated with the calculated corresponding information criterion. In that way the ICs don't necessarily need to give different answers in order to demonstrate the relation between the IC and the cross validation method. – russellpierce Jul 26 '10 at 18:04
  • 1
    Wow, this is something that will be even harder do obtain I'm afraid; my general point in the whole discussion is that the convergence of those theorems is too weak so the difference may emerge from random fluctuations. (And that it is not working for machine learning, but I hope this is obvious.) –  Aug 03 '10 at 23:19
  • That is interesting. If the convergence is so weak I wonder why people even brought it up in the other question about AIC/BIC, or how they figured out that the cross validation and IC methods were actually convergent; I'm not really involved in machine learning, so the last bit that you hope is obvious isn't obvious to me, could you expand on that point? Besides, the entire thing may be moot, as I say below - a correlational approach to the problem doesn't look like it will work (at least not while sample sizes are constant). – russellpierce Aug 04 '10 at 00:05
  • mbq; Is the question in its current form unanswerable or just difficult? – russellpierce Aug 08 '10 at 23:40
  • Not empirical, but related to this topic discussing BIC/AIC and cross-validation [here](http://stats.stackexchange.com/questions/2352/when-are-shaos-results-on-leave-one-out-cross-validation-applicable) – probabilityislogic Dec 13 '11 at 15:10

1 Answers1

5

In an attempt to partially answer my own question, I read Wikipedia's description of leave-one-out cross validation

involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data.

In R code, I suspect that that would mean something like this...

resid <- rep(NA, Nobs) 
for (lcv in 1:Nobs)
    {
        data.loo <- data[-lcv,] #drop the data point that will be used for validation
        loo.model <- lm(y ~ a+b,data=data.loo) #construct a model without that data point
            resid[lcv] <- data[lcv,"y"] - (coef(loo.model)[1] + coef(loo.model)[2]*data[lcv,"a"]+coef(loo.model)[3]*data[lcv,"b"]) #compare the observed value to the value predicted by the loo model for each possible observation, and store that value
    }

... is supposed to yield values in resid that is related to the AIC. In practice the sum of squared residuals from each iteration of the LOO loop detailed above is a good predictor of the AIC for the notable.seeds, r^2 = .9776. But, elsewhere a contributor suggested that LOO should be asymptotically equivalent to the AIC (at least for linear models), so I'm a little disappointed that r^2 isn't closer to 1. Obviously this isn't really an answer - more like additional code to try to encourage someone to try to provide a better answer.

Addendum: Since AIC and BIC for models of fixed sample size only vary by a constant, the correlation of BIC to squared residuals is the same as the correaltion of AIC to squared residuals, so the approach I took above appears to be fruitless.

russellpierce
  • 17,079
  • 16
  • 67
  • 98
  • note that this will be your accepted answer for the bounty (in case you do not choose an answer the bounty automatically select the answer with the most points) – robin girard Aug 09 '10 at 11:21
  • 1
    well - awarding the bounty to myself seems silly - but nobody else has submitted an answer. – russellpierce Aug 09 '10 at 21:01