How to compute confidence interval for Leave-one-out-cross-validation (LOOCV)

Question

I have a very small data set of 50 samples, and I am performing LOOCV for evaluating the performance of a simple logistic regression model. I want to know the confidence interval of my evaluation, is this possible for LOOCV? I have tried running the loocv experiment multiple times with different random seeds (including setting the random state in the scikit learn implementation of logistic regression), I always get the same results for each run so the variance is zero.

My background is not statistics so any suggestions would be greatly appreciated!
Thanks.

UPDATE: Thanks very much for all the answers below! I have learned not just this particular problem but cross validation in general. https://avehtari.github.io/modelselection/CV-FAQ.html is also a good source of information for learning the problem around CV.

score 4 · Accepted Answer · answered Aug 25 '20 at 12:10

I have tried running the loocv experiment multiple times with different random seeds [...] I always get the same results for each run so the variance is zero.

Of course you get the same results, for LOO the random seed cannot change anything but the order in which the different surrogate models are evaluated: one run of LOO consists of n surrogate models that each use 1 case for testing and the remaining n - 1 cases for training. However many runs you do, the model testing case i will in each run be trained with the same training set.
LOO is exhaustive in the sense that all possible models with n - 1 training cases are computed in the standard run.

With LOO, you cannot distinguish variance uncertainty due to model instability from variance uncertainty due to the tested cases. This is because you always test exactly one case with exactly one surrogate model - no surrogate model is tested with more than one case and no case is tested with more than one model.
I consider this a fundamental flaw in the Design of Exeriments underlying the LOO.

Calculating proper confidence or credible intervals for figures of merit (e.g. generaliziation error) are in general somewhere between difficult and impossible to calculate:

Several sources of uncertainty contribute to the total uncertainty.

Variance uncertainty due to the finite number of tested cases.

For some figures of merit such as classification accuracy, sensitivity, specificity (in general: proportions of tested cases), you can use a binomial distribution. Since the binomial distribution has its variance dependent on proportion and number of trials, you only need e.g. observed number of correct cases and number of tested cases to arrive at a confidence or credible interval.

Not sure about python modules, but e.g. R package binom provides such calculations (various approximations available, also literature references). Any such interval would assume that all other sources of uncertainty are negligible (which can be a valid assumption in certain circumstances).
For other figures of merit, you can do error propagation from the residuals. Or e.g. bootstrap the figure of merit from your individual predictions.

Model instability, i.e. the variation of the true performance of your surrogate models. As I explained above, LOO conflates this with case-to-case variation (for a particular prediction that is far off, you cannot know whether the model is bad or the case is difficult or both.)
Repeated cross validation of a variety that leaves out more than one case at a turn/per fold (or many other resampling validation schemes) allow to directly assess model (in)stability. See e.g. our paper Beleites, C. & Salzer, R. Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 2008, 390, 1261-1271.
In case you are interested in the performance of a model trained with this particular algorithm on a training set of the given size rather than the model you obtain with this particular algorithm from the training data at hand, there is further uncertainty that you fundamentally cannot measure by resampling validation. See e.g. Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105
We haven't even been considering bias so far.

One practically important scenario is that you have obtained a stable model (show that instabiliy is negligible, so no need to care about 2.), and your application means that 3. does not apply. In that case, you can go ahead and compute your intervals according to 1.
This is fairly often the case for tasks where you train a model for production use and restrict model complexity to produce stable models.

Of course, you could also derive a confidence interval that covers variances 1 and 2.

An easier alternative that may serve as well would be to show these observed variations (1. and 2.) without claiming a confidence interval.

Such a great answer! Thank you very much! I have also just discovered this paper https://arxiv.org/abs/2008.10296 which isn't directly applicable to my case but shed lights on the problem nonetheless. Thanks again. — Blue482, Aug 25 '20 at 20:20
Can I pls ask you if you'd prefer 10fold CV (gives wider CI) or 10fold CV repeated over 100 times (which gives narrower CI) for evaluating a data set of 50 samples? Is the latter approach valid estimation for evaluating model instability? Thanks! — Blue482, Aug 27 '20 at 15:08
repeated 10-fold CV is a valid approach to estimate model instability. The CI should be the same or *wider* than 10-fold CV without repetitions: without repetitions you can only calculate a CI that does not cover the effect of model instability. Since you say you got a narrower CI for repeated: you can **not** calculate a CI for 10x100 = 1000 folds for repeated CI! You need to calculate a CI for variance (sample size) + variance (instability). variance (sample size) should be the same whether you repeat or not. — cbeleites unhappy with SX, Aug 27 '20 at 17:58
Thanks very much! You're right that I calculated CI for 10*1000 folds! But are you suggesting I should obtain a CI for each 10-fold CV, and then average all CIs in the end to obtain the final CI? — Blue482, Aug 27 '20 at 18:30
NO, it's not that easy. You need to separate both variances and then compute the total variance. They are additive if there's no correlation between both factors - which I think is OK for a start here (the more so, as violations of this assumption mean only that your CI is too wide) — cbeleites unhappy with SX, Aug 27 '20 at 18:34
I have computed the variance of sample size using binomial proportion CI as you suggested, say (0.68, 0.95). I still don't know how to obtain variance for model instability using repeated 10-fold CV, where I have N confidence intervals. Sorry for being a noob... — Blue482, Aug 27 '20 at 18:57
kinda handwavy: compute binomial variance -> var_test. Calculate figure of merit for all repetitions. calculate variance across these. Each run aggregates (averages) k = 10 folds. we thus expect the variance across runs to be 1/10 of the variance due to model instability. => var_instability = k * variance across runs, proceed with normal approximation (make sure normal approximation is OK for your binomial CI). Alternative: set up a bootstrap simulation. Does this help? — cbeleites unhappy with SX, Aug 28 '20 at 00:25
Thanks. If I am not mistaken, each run here is a repetition right? So 1), for each run I compute say averaged accuracy score for 10-fold cv; 2), compute variance of these accuracy scores across say 100 runs; 3), k*variance = 10*variance = final var?? I am not sure if this is what you meant or compute variance per run then average variances across runs?? — Blue482, Aug 28 '20 at 20:17
Thanks very much! Just to confirm (sorry for being nagging), you meant you prefer the former approach (i.e. the 3 steps) than the latter one, right? — Blue482, Aug 28 '20 at 22:05

score 3 · Answer 2 · answered Aug 25 '20 at 03:32

One way would be to take the mean and standard deviation and apply the central limit theorem to justify the old mean + 2 standard error formula. Because each fold is very highly correlated there may or may not be some objections to doing this. I think the best way is to actually bootstrap the entire process and then correct for optimism in the training error via the Efron Gong bootstrap procedure. The procedure is explained here quite well in R, and could be translated to python with a little effort.

Peter Leopold · Answer 3 · 2020-08-25T03:26:44.913

If we understand the expression "the confidence interval of my evaluation" to mean "a credible range of values for each parameter I infer when I perform logistic regression on training data using the LOO cross validation technique." For a training set with $n$ values of $p$-dimensional predictors $x_i$ and response $y_i, i=1\ldots n$, you will perform $n$ calculations of $n$ LOO subsets each with $n-1$ points. For each LOO subset, you will calculate $\hat{\beta}^{(i)}, i=1\dots n$ From this you may find the 2.5 percentile and the 97.5 percentile for each parameter value $\beta_j$, and report that.

These numbers should agree with the frequentist confidence interval that you'd obtain by running the the logistic_regression.fit(model,data,hessian=True) function (in whatever package you are using) with a flag set to return the Hessian matrix. The diagonal elements of the Hessian are roughly equivalent to the inverse variances of the elements of $\beta.$ Then you'd report, e.g., $$ Pr\bigg( |\beta_j-\hat{\beta}_j| < t^c_{\alpha=0.05/2}(\nu) \times \sqrt{\frac{1}{\text{Hessian}[j,j]}}\bigg) < \alpha/2 $$ where $\nu=n-p-1$ is the degrees of freedom and $p$ is the number of dimensions of the independent variable. But this would require only one logistic regression calculation and no LOO, which seems not to be your interest.

How to compute confidence interval for Leave-one-out-cross-validation (LOOCV)

3 Answers3

Linked