Set-up
I have a chemometric dataset with a hierarchical structure: 18 patients with ~250 spectra per patient. I want to compare some binary classifiers on this dataset which has 9 patients with and 9 without a disease.
For hyper-parameter optimisation I am performing leave-two-patients-out cross-validation (one with and one without the disease). I chose this because it seemed sensible to have positive and negative cases in the validation set. I'm simply using accuracy to compare performance.
The problem
A curious thing occurs with this set-up. There is a very high variance in the accuracy of a classifier depending on which two patients are held aside to assess the accuracy. The range of accuracy on the validation sets varies from 92% to as low as 12% in one instance. Even aside these two extremes, there is considerable variation in results.
My interpretations
1.) Perhaps there is naturally a significant degree of inter-patient variation in such samples making it difficult to generalise the performance to new patients. The sample size may be too small to capture this natural variation - i.e. it's not representative of the population.
2.) Perhaps there is a labelling error (especially the case with the 12% accuracy). I will ask, but it might be some time before an expert can have a look, so I want to be sure it's worth the effort first.
3.) I have heard of model instability, but I'm not sure I understand it. Is this a manifestation of it? Model instability implies to me that instability (i.e. high variance) is caused by the model - but would it not be the case that the instability is a feature of the data rather than the models? Or at least an interaction of the models and the data?
The classifiers I want to compare are PCA-LDA, SVM and a CNN. I plan to perform an external CV to assess which performs best (so the entire process will be nested CV). The instability is present, and roughly the same, for all these models.
4.) I've done something silly.