7

I thought K-fold cross-validation consists of the following steps.

  1. Split data randomly into $K$ chunks.
  2. Fit on $K-1$ chunks.
  3. Predict on remaining chunk. Keep predictions.
  4. Repeat 2-3 for all remanining $K-1$ combinations of the $K$ chunks that omit 1 chunk.
  5. Evaluate Loss statistic that compares all predictions to true values.

Now I have seen (xbart in dbarts package) the following procedure:

  1. Split data randomly into $K$ chunks.
  2. Fit on $K-1$ chunks.
  3. Predict on remaining chunk. Evaluate loss statistic and keep.
  4. Repeat 1-3 $N$ times.
  5. Average the $N$ loss statistics or pool in some other way.

Note the difference in steps 4 and 5.

The first procedure is standard and recommended in major text books. The second procedure seems new. I cannot see immediately why not to do it, but it seems not optimal in terms of variance. Are there arguments in favor or against the second procedure?

The second approach is implemented in the package quoted above and I wonder if this is wrong to do.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
tomka
  • 5,874
  • 3
  • 30
  • 71
  • 2
    If your loss is defined per observation (e.q. squared error for a particular observation), then average loss will be the same either way. I wonder then in which situations is loss not defined per observation but rather is a function of a whole chunk of observations at once. Perhaps median error per chunk? Then one would need to think about how that differs between the two scenarios. – Richard Hardy Feb 26 '18 at 18:56
  • 1
    @RichardHardy Not sure. Perhaps ratio based estimates such as area under the ROC (AUC) statistics? Median error for sure. – tomka Feb 26 '18 at 19:03
  • @RichardHardy Even in the simple case procedure 1 seems to be the efficient one, as it seems $N$ needs to be large (like in the bootstrap) to control the variance of the loss estimate, thus requiring far more model fits. Or am I going wrong somewhere? – tomka Feb 26 '18 at 19:16
  • Think of the simplest case: squared loss as the loss function, 2 folds and 2 observations per fold: $(x_{1,1},x_{1,2})$, $(x_{21,},x_{2,2})$ and the corresponding forecast errors $e_{ij}$. It does not matter which procedure I use as in both cases I get $\text{MSE}=\frac{1}{2}(\frac{1}{2}(e_{1,1}^2+e_{1,2}^2)+\frac{1}{2}(e_{2,1}^2+e_{2,2}^2))=\frac{1}{4}(e_{1,1}^2+e_{1,2}^2+e_{2,1}^2+e_{2,2}^2)$. – Richard Hardy Feb 26 '18 at 19:44
  • @tomka *1)* Do I understand correctly that the difference is in points 4 and 5? *2)* Which loss statistics are allowed in `xbart`? The method is certainly incorrect for RMSE which is subadditive. – Jim Feb 26 '18 at 21:19
  • What `xbart` does seems completely reasonable. Exempt from Kuhn & Johnson, "Applied Predictive Modeling" (Sect. 4.4) that agrees with this "average of $k$" procedure for a threefold cross-validation: "*Performance estimates, such as the error rate or $R^2$ are calculated from each set of held-out samples. The average of the three performance estimates would be the cross-validation estimate of model performance.*" Or from Wasserman's "All of Statistics" (Sect. 13.6): "*This process is repeated for each of the $k$ groups and the resulting risk estimates are averaged.*" – usεr11852 Feb 26 '18 at 23:33
  • @Jim 1) Yes indeed. 2) Misclassification rate for factors and Squared loss for continuous variables. As Richard Hardy has pointed out the procedures are equivalent in these cases. Yet fitting $N$ models will take longer than fitting $K$ models, as usually $K – tomka Feb 27 '18 at 08:41
  • @usεr11852 Sorry but I believe there is a misunderstanding. The quotes all refer to procedure 1 but not 2. – tomka Feb 27 '18 at 08:42
  • @RichardHardy Yes in this case it's equivalent. But consider the sheer number of models that have to be fit in procedure 2 as compared to 1. Meanwhile the package author replied and called his procedure 2 "Monte-Carlo Cross-Validation". – tomka Feb 27 '18 at 08:44
  • @tomka, looking at your description of the two procedures, it seems the number of model being fit is all the same. The only difference is at which stage one calculates the loss statistic. – Richard Hardy Feb 27 '18 at 09:01
  • @RichardHardy Procedure 1 calculates the loss once on each case for sure. Procedure 2 seems to do so only if $N$ is large. For, say, $K=10=N$ the second procedure does quite likely not use all cases for testing. I believe this difference must impact the variance properties of the loss estimates. But this is basically my question. – tomka Feb 27 '18 at 09:11
  • @tomka, ohhh! I missed the difference in 4. between the procedures... – Richard Hardy Feb 27 '18 at 09:41
  • @tomka: I did the same mistake as Richard. Sorry. – usεr11852 Feb 27 '18 at 22:37

1 Answers1

6

Short answer: it is neither wrong nor new.


We've been discussing this validation scheme under the name "set validation" ≈ 15 a ago when preparing a paper*, but in the end never actually referred to it as we didn't find it used in practice.

Wikipedia refers to the same validation scheme as repeated random sub-sampling validation or Monte Carlo cross validation

From a theory point of view, the concept was of interest to us because

  • it is another interpretation of the same numbers usually referred to as hold-out (just the model the estimate is used for is different: hold-out estimates are used as performance estimate for exactly the model tested, this set or Monte Carlo validation treats the tested model(s) as surrogate model(s) and interprets the very same number as performace estimate for a model built on the whole data set - as it is usually done with cross validation or out-of-bootstrap validation estimates)
  • and it is somewhere in between
    • more common cross validation techniques (resampling with replacement, interpretation as estimate for whole-data model),
    • hold-out (see above, same calculation + numbers, typically without N iterations/repetitions, though and different interpretation)
    • and out-of-bootstrap (the N iterations/repetitions are typical for out-of-bootstrap, but I've never seen this applied to hold-out and it is [unfortunately] rarely done with cross validation).

* Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
The "set validation" error for N = 1 is hidden in fig. 6 (i.e. its bias + variance can be recostructed from the data given but are not explicitly given.)


but it seems not optimal in terms of variance. Are there arguments in favor or against the second procedure?

Well, in the paper above we found the total error (bias² + variance) of out-of-bootstrap and repeated/iterated $k$-fold cross validation to be pretty similar (with oob having somewhat lower variance but higher bias - but we did not follow up to check whether/how much of this trade-off is due resampling with/without replacement and how much is due to the different split ratio of about 1 : 2 for oob).
Keep in mind, though, that I'm talking about accuracy in small sample size situations, where the dominating contributor to variance uncertainty is the same for all resampling schemes: the limited number of true samples for testing, and that is the same for oob, cross validation or set validation. Iterations/repetitions allow you to reduce the variance caused by instability of the (surrogate) models, but not the variance uncertainty due to the limited total sample size.
Thus, assuming that you perform an adequately large number of iterations/repetitions N, I'd not expect practically relevant differences in the performance of these validation schemes.

One validation scheme may fit better with the scenario you try to simulate by the resampling, though.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133