I have tried running the loocv experiment multiple times with different random seeds [...] I always get the same results for each run so the variance is zero.
Of course you get the same results, for LOO the random seed cannot change anything but the order in which the different surrogate models are evaluated: one run of LOO consists of n surrogate models that each use 1 case for testing and the remaining n - 1 cases for training. However many runs you do, the model testing case i will in each run be trained with the same training set.
LOO is exhaustive in the sense that all possible models with n - 1 training cases are computed in the standard run.
With LOO, you cannot distinguish variance uncertainty due to model instability from variance uncertainty due to the tested cases. This is because you always test exactly one case with exactly one surrogate model - no surrogate model is tested with more than one case and no case is tested with more than one model.
I consider this a fundamental flaw in the Design of Exeriments underlying the LOO.
Calculating proper confidence or credible intervals for figures of merit (e.g. generaliziation error) are in general somewhere between difficult and impossible to calculate:
Several sources of uncertainty contribute to the total uncertainty.
- Variance uncertainty due to the finite number of tested cases.
For some figures of merit such as classification accuracy, sensitivity, specificity (in general: proportions of tested cases), you can use a binomial distribution. Since the binomial distribution has its variance dependent on proportion and number of trials, you only need e.g. observed number of correct cases and number of tested cases to arrive at a confidence or credible interval.
Not sure about python modules, but e.g. R package binom provides such calculations (various approximations available, also literature references). Any such interval would assume that all other sources of uncertainty are negligible (which can be a valid assumption in certain circumstances).
For other figures of merit, you can do error propagation from the residuals. Or e.g. bootstrap the figure of merit from your individual predictions.
Model instability, i.e. the variation of the true performance of your surrogate models. As I explained above, LOO conflates this with case-to-case variation (for a particular prediction that is far off, you cannot know whether the model is bad or the case is difficult or both.)
Repeated cross validation of a variety that leaves out more than one case at a turn/per fold (or many other resampling validation schemes) allow to directly assess model (in)stability. See e.g. our paper Beleites, C. & Salzer, R. Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 2008, 390, 1261-1271.
In case you are interested in the performance of a model trained with this particular algorithm on a training set of the given size rather than the model you obtain with this particular algorithm from the training data at hand, there is further uncertainty that you fundamentally cannot measure by resampling validation. See e.g. Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105
We haven't even been considering bias so far.
One practically important scenario is that you have obtained a stable model (show that instabiliy is negligible, so no need to care about 2.), and your application means that 3. does not apply. In that case, you can go ahead and compute your intervals according to 1.
This is fairly often the case for tasks where you train a model for production use and restrict model complexity to produce stable models.
Of course, you could also derive a confidence interval that covers variances 1 and 2.
An easier alternative that may serve as well would be to show these observed variations (1. and 2.) without claiming a confidence interval.