4

I am working on a binary classification problem with relatively few instances (e.g. ~30 instances out of which ~7 are positives).

I have noticed that when using 2-fold the average classification performance of the best performing model is better than the best performing model with 5-fold.

In fact,

  • The best performing model in 2-fold CV gets the following scores across the two folds:

    [0.82, 0.82] (avg. = 0.82).

  • That model is different from the best one I get with 5-fold CV, which yields the following AUC scores:

    [0.4 , 1. , 0.75, 0.75, 0.25] (avg = 0.64).

This takes me to the following question: Which model should I use? And why would I ever get a better model when training with less data?

Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

1 Answers1

5

5-fold cross-validation is generally better than 2-fold. Closer to the gold standard would be 100 repeats of 10-fold cross-validation, or to use the Efron-Gong optimism bootstrap. BUT your sample is not sufficient even for estimating a single parameter, much less to form predictions and do cross-validation. The 0.95 Wilson confidence interval for the probability of positive given 7/30 positive is [0.12, 0.41].

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • Thanks Frank! This gives me a great perspective into the problem. What sizes of the the 0.95 Wilson CI are typically considered "acceptable" or healthy? – Amelio Vazquez-Reina Jul 10 '13 at 14:26
  • 1
    Think about with what precision you'd like to be able to estimate a binary logistic modeling containing only an intercept. n=96 could be thought of as a bare minimum number of observations to estimate just this one parameter. That will achieve a 0.95 margin of error of +/- 0.1 on an estimated probability in the worst case where the probability is 0.5. – Frank Harrell Jul 10 '13 at 18:23
  • Thanks Frank. That makes sense. By the way, the Wilson CI that I am getting for 7 positives with a sample size of 30 is `[0.16, 0.31]` (not `[0.12, 0.41]`). Not sure why we get different numbers. Which implementation did you use? I am using the formula here: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval which was implemented [here](http://stackoverflow.com/questions/10029588/python-implementation-of-the-wilson-score-interval) (translated with `ups= # pos` and `downs = n - pos`) – Amelio Vazquez-Reina Jul 10 '13 at 20:24
  • 1
    In R, do `require(Hmisc); ?binconf`. Type `binconf` to see the code. I think it agrees with the reference you listed. Code is pretty easy to read. – Frank Harrell Jul 10 '13 at 20:57
  • Thanks Frank. By the way, about that gold standard, there is an interesting question about good vs bad practice here: http://stats.stackexchange.com/questions/64147/estimators-reporting-on-the-cv-sample-or-on-a-fully-separate-hold-out-set. It would be fantastic to see your opinion on it. – Amelio Vazquez-Reina Jul 12 '13 at 14:46
  • 1
    I so feel that unless the training and test samples both exceed 10,000 observations, it is often misleading to report accuracy on a holdout sample as opposed to doing intensive rigorous resampling. – Frank Harrell Jul 12 '13 at 16:51
  • Thanks Frank. I am having a hard time convincing some people in my academic community of that same observation that you just made. Do you know of particular papers/surveys that show that indeed reporting statistics on CV resampling (rather than a point estimate on a hold-out test) is an acceptable standard? – Amelio Vazquez-Reina Jul 12 '13 at 17:29
  • 1
    Simulations show it best. Look for example at Studies of Methods Used in the Text under http://biostat.mc.vanderbilt.edu/RmS – Frank Harrell Jul 12 '13 at 18:49