Interpreting variance in cross-validated classifiers

Question

Set-up

I have a chemometric dataset with a hierarchical structure: 18 patients with ~250 spectra per patient. I want to compare some binary classifiers on this dataset which has 9 patients with and 9 without a disease.

For hyper-parameter optimisation I am performing leave-two-patients-out cross-validation (one with and one without the disease). I chose this because it seemed sensible to have positive and negative cases in the validation set. I'm simply using accuracy to compare performance.

The problem

A curious thing occurs with this set-up. There is a very high variance in the accuracy of a classifier depending on which two patients are held aside to assess the accuracy. The range of accuracy on the validation sets varies from 92% to as low as 12% in one instance. Even aside these two extremes, there is considerable variation in results.

My interpretations

1.) Perhaps there is naturally a significant degree of inter-patient variation in such samples making it difficult to generalise the performance to new patients. The sample size may be too small to capture this natural variation - i.e. it's not representative of the population.

2.) Perhaps there is a labelling error (especially the case with the 12% accuracy). I will ask, but it might be some time before an expert can have a look, so I want to be sure it's worth the effort first.

3.) I have heard of model instability, but I'm not sure I understand it. Is this a manifestation of it? Model instability implies to me that instability (i.e. high variance) is caused by the model - but would it not be the case that the instability is a feature of the data rather than the models? Or at least an interaction of the models and the data?

The classifiers I want to compare are PCA-LDA, SVM and a CNN. I plan to perform an external CV to assess which performs best (so the entire process will be nested CV). The instability is present, and roughly the same, for all these models.

4.) I've done something silly.

Can you please explain why you obtain an accuracy just from teh two samples left out? In LOO, the accuracy is computed over *all* symples cyclically left out. — cdalitz, Oct 29 '20 at 14:34
I obtain repeated measures of the accuracy, one for every pair of patients left out. From this I can obtain the average accuracy and its standard deviation. It is the variance of these accuracies that has me bemused. — N Blake, Oct 29 '20 at 15:36
I don’t see the benefit in leave two out vs leave one out. There isn’t need to ensure each fold has a positive and a negative case as each record eventually gets tested anyway in LOO — astel, Oct 29 '20 at 17:06
You only have k=9 CV measurements. As the variance of the mean value goes with $1/\sqrt{k}$, I would expect a large variance from such a small sample. — cdalitz, Oct 30 '20 at 14:10
@astel I suppose that is true so long as I'm using accuracy - if I were using sensitivity/specificity then the lack of pos or neg cases would be problematic? — N Blake, Nov 02 '20 at 12:31
@cdalitz Is that variance a well established result? I see it makes sense to have a large variance but I guess the range of 92% to 12% was more than I was expecting. — N Blake, Nov 02 '20 at 12:34
It doesn’t matter the metric you are using. In the end you are going to have 18 records that you have classified and calculate your metric on those 18 records, which test set they belonged to is irrelevant — astel, Nov 02 '20 at 15:23
@n-blake "I obtain repeated measures of the accuracy, ..." What do you mean by this? You should not obtain repeated measures of the accuracy, but each sample left out can only yield 1 or 0, where 1 means correct classification and 0 means incorrect classification. How do you obtain values like 92% or 12%???? — cdalitz, Nov 03 '20 at 13:38
@cdalitz each sample (patient) had ~250 spectra taken and it is these spectra that are being classified, so one patient, belonging to only one class, has ~250 measures of disease or no disease. — N Blake, Nov 04 '20 at 09:33
Hm, I would have thought that the 250 spectra are the features that are used as predictors for the two classes desease and no desease. If the spectra are the classes (250 different classes?), what are the features used to predict them? — cdalitz, Nov 04 '20 at 10:09
Each spectra has an intensity on the y axis and (binned) wavenumber on the x axis. There are about 900 bins, which are the features. The goal is to be able to classify by spectrum. It's a somewhat similar set-up in [this paper](https://www.sciencedirect.com/science/article/pii/S1572100019301620?casa_token=gmwXh1oHkeMAAAAA:hrYI6KuNBx-yMn1CKaQ8oayPf2Kd_ZF_kwRltcVJY-up00-UQk6aC9hIIL6hO3hvaJqNpixRZ80). — N Blake, Nov 04 '20 at 11:55

score 1 · Accepted Answer · answered Oct 29 '20 at 15:31

Some issues to consider here:

First, the number of cases is very small, regardless of whether your sample is representative of the population. That leads to:

Second, your CV is necessarily something close to leave-one-out CV (in this case, one per group), which is known to have high variance. See Section 5.1.4 of An Introduction to Statistical Learning. With your approach, you only have a total of 81 models (9*9) among which to average.

Third, accuracy is a poor measure of model performance.

So try broader approaches for hyperparameter choices, like multiply repeated 5-fold CV (without necessarily limiting yourself to equal numbers of each class) or bootstrapping. During CV or bootstrapping, use a proper scoring rule like log-loss (equivalent to the logistic regression maximum-likelihood deviance criterion) or the Brier score (the classification equivalent of mean-square error) for evaluating performance.

You will still be limited by the small number of cases and potential un-representativeness of your sample. But those approaches probably have a better chance of working with your present data set.

The statement that LOO CV is known to have high variance is either wrong or at least not categorically correct. See https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation — astel, Oct 29 '20 at 16:54
Lot's to consider, thanks. I've never come across things like the Brier score in the medical literature. I'll look into it and get back with what I've found. — N Blake, Nov 02 '20 at 12:41
@NBlake take a look at the [resources provided by Frank Harrell](https://hbiostat.org/), a well respected expert on biostatistics. He freely provides pdfs of course notes, and has written a comprehensive text on Regression Modeling Strategies. See the links from that page. Those should give you a good start on important statistical considerations that are too often glossed over in the biomedical literature. — EdM, Nov 02 '20 at 15:44

Interpreting variance in cross-validated classifiers

1 Answers1