What validation if KFold scores differ a lot? Repeated KFold, LOO or Holdout?

Question

Suppose you are given a medium-sized dataset and you did a KFold validation once. You notice that scores on each old differ noticeably. Which validation type is the most practical?

I thought about continuing to use KFold because if the scores dataset differ on each fold then probably I can do a repeated K-Fold, as I've heard its good when facing a medium-sized dataset.

A noisy estimate of model performance can be frustrating as it may not be clear which result should be used to compare and select a final model to address the problem.

One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.

An alternate approach is to repeat the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats. This approach is generally referred to as repeated k-fold cross-validation.

Jason Brownlee on MachineLearningMastery.com

But I was wondering if LOO or Holdout would have been better.

Here is the minimal reproducible code I can compare too:

# evaluate a logistic regression model using repeated k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

score 2 · Answer 1 · answered Apr 13 '21 at 11:02

I would generally avoid leave-one-out cross-validation for performance evaluation as it tends to have a high variance. I use it a lot for model selection (e.g. hyper-parameter tuning) because for many models it can be evaluated very cheaply.

Whether bias is important depends on what you want the performance estimate for. If you just want to choose between competing models, then the bias is irrelevant if it is essentially the same for all models. Jacques Wainer and I investigated the choice of resampling techniques for model selection and found it really didn't make much difference in practical terms.

Jacques Wainer and Gavin Cawley, "Empirical Evaluation of Resampling Procedures for Optimising SVM Hyperparameters", Journal of Machine Learning Research 18 (2017) 1-35 (pdf)

I tend not to bother looking at the performance in each fold of the cross-validation, I've never found it very useful. For classification problems, misclassification error is a fairly brittle metric and small changes in the model can produce comparatively large changes in performance. While performing additional re-samplings (e.g. repeated k-fold cross-validation) will reduce the standard error of the mean, it won't change the variance of the folds themselves. So it may give a more precise estimate of performance, but it won't necessarily produce a useful improvement in performance. This is because for small and medium sized datasets, the difference in the simple k-fold cross-validation performance and the refined repeated k-fold cross-validation may be too small to be of practical significance, even if it is statistically significant (however this does depend on misclassification costs, if you are performing a medical screening test a false-negative might cost someone their life!).

There is also the point that it is possible to over-fit the model selection criterion if there are too many choices to be made. A more reliable, less noisy estimate may be less susceptible to over-fitting.

FWIW I tend to use leave-one-out or 5- or 10-fold cross-validation for model selection purposes and repeated split-sample validation for performance estimation (where the cross-validation for model selection is nested in each iteration of the performance estimation). Repeated k-fold cross-validation seems to have an inelegant asymmetry, which is probably irrelevant, and it seems just as easy to perform the same number of truly random test-training splits of the same sizes. I also like bootstrapping for performance estimation, especially when using bagging as a "belt-and-braces" approach to avoiding over-fitting and it gives an out-of-bag estimate for the ensemble.

score 2 · Answer 2 · answered Apr 13 '21 at 11:55

I've just been learning about this myself so I'll share what I found.

I believe it is Efron(1983) who established that LOOCV is 'nearly unbiased', but suffers from very high variance, especially with small samples.

Kohavi(2001) has some cool graphs looking at the bias and variance of accuracy as a function of K. Basically it shows as K decreases variance increases, although this makes for a pessimistic bias. I'm not sure if this paper is where the rule-of-thumb recommendation for 10-folds comes from, but it supports it.

I think for holdout to work well you need quite a large sample, depending upon model complexity.

More complex models are more prone to higher variance in that they are easier to overfit to the training data, hence becoming more sensitive to it. (I think this is true, but I've not come across a study that explicitly shows as much).

Something else I learned from this community is that in general accuracy is not a great metric for assessing model performance.

I'm discovering there's loads of nuances to this, but the above got me started.

The approximate unbiasedness of LOOCV has a much longer history than that, as I understand it, it was established by A. Luntz, V. Brailovsky On estimation of characters obtained in statistical procedure of recognition Techicheskaya Kibernetica, 3 (1969), but my technical Russian is a bit weak ;o) It is interesting how often this paper is used to justify LOOCV in situations where the variance is likely to be a far greater issue ;o) — Dikran Marsupial, Apr 13 '21 at 12:40
Cheers. It's why I like to know the history of ideas, it's easier to see where, and sometimes why, over simplifications propagate through a community. — N Blake, Apr 13 '21 at 13:47
A leading figure in machine learning once said "The best place to find a new machine learning algorithm is in an old statistics journal" ;o) — Dikran Marsupial, Apr 13 '21 at 14:40

What validation if KFold scores differ a lot? Repeated KFold, LOO or Holdout?

2 Answers2