I have trained and tested four different classifiers, and I would now like to compare them. The classifiers have accuracies 95, 90, 81, 75.
In order to say more, we at the very least need to know the number of test cases (see example below).
I know that there is no unbiased estimator of the variance of K-fold cross-validation accuracy, which is going to make things difficult.
$k$-fold cross validation cannot measure the variance connected to the fact that your particular data set may not be totally representative for the problem at hand. In other words, the assumption that re-sampling is a good approximation to getting actual new samples breaks down. But another part of the variance of the models can be measured by iterated/repeated $k$-fold cross validation: the variance in the predictions caused by exchanging a few training cases (model instability).
The still unknown residual model variance matters if the question at hand is "Which classifier performs best for problems of this type of application?" I.e. in scenarios where the winning classifier is then trained on a completely new sample of the same type.
If the scenario is that from this particular data set a model should be trained and you then need an estimate how well that model predicts unknown cases later on, this never-to-be-known-by-resampling-validation variance is not important - the instability measurements should yield an estimate that in practice tells you all you need to know.
All this is about variance uncertainty on the model.
Note that the finite number of cases also means that the testing is subject to variance, which is typically assumed to follow a binomial distribution for performance measures of the fraction-of-tested-cases type, i.e. overall accuracy, sensitivity, specificity, predicitive values, etc.
You can account for this variance by constructing binomial confidence intervals. E.g. assuming you observe 17 correct out of 21 tests, the 95% confidence interval for the accuracy is roughly 60 - 93 % (point estimate: 81 %). Here, the actual number of independent (different) cases enters: iterating the cross validation does not change this number!
Practical considerations:
If the confidence intervals calculated just on the finite number of test cases overlap, or a paired test based on the number of test cases (look e.g. into McNemar's test) is not significant, you cannot claim a difference.
In that case, you don't need to worry about the not-yet-accounted for variance in the models as it can never lower the total variance uncertainty.
The other way round: do sample size (or power) calculations, e.g. with the observed accuracies. If these calculations yield sample sizes that are way above what you can get, report that data-driven model comparison is impossible and instead choose the final classifier by your knowledge about the type of classifier, the application and the data at hand.
Use iterated/repeated $k$-fold cross validation to check the stability of the predictions wrt. slight changes in the training set. If you detect instability, in practice you should probably reconsider the way you set up your classifier rather than trying to include this variance estimate into the comparison.
Decide whether you'd need to include the undetected variance due to resampling not being the same as getting new samples into your comparison (scenarios above). If not, you're fine. If yes, report that your results do not include this possible additional variance (you anyways cannot do more).
If you are training all classifiers in question, you can and should set up a paired comparison: for each particular split, calculate and test all classifiers.
This allows far more powerful statistical tests for the comparison.