8

Suppose I want to estimate the out-of-sample prediction error of a boosted regression model that has random intercepts and slops. There are $G$ groups and $N$ observations. If I want to estimate the out-of-sample prediction error using $k$-fold cross-validation, how do I set up the data partitioning? Is it more complicated that $k$-fold cross-validation? Note: my use case here is the prediction of the data from a new group.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Brash Equilibrium
  • 3,565
  • 1
  • 25
  • 43
  • 2
    I think I have seen people doing some sort of CV within CV, where they first take k of G groups out, and with the remaining data take k2 of N data out from each of k groups. But in my projects I simply do a CV on the group level, that's it. So if G is not very large, I do a G-fold CV. Otherwise k fold CV where k – qoheleth Sep 12 '14 at 04:57
  • Was totally going to ask if nested folds were the way to go. I mean, that's what it seems like to me, but only if you are guaranteed to have enough to make k_indiv folds for each individual. – Brash Equilibrium Sep 12 '14 at 08:39
  • I think k-folds derived from the group level alone are problematic, because you are not cross-validating the within-individual predictive accuracy. – Brash Equilibrium Sep 12 '14 at 08:40
  • yea, but I guess we have a bias-variance trade off here. – qoheleth Sep 16 '14 at 04:07
  • Not if we can come up with the proper partitioning mold that reflects the sampling model – Brash Equilibrium Sep 16 '14 at 21:24
  • This paper (Roberts et al 2017) discusses CV stratagies for data with dependence structures (including group based dependence) is worth reading: https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881 Good overview + discussion – adibender Jun 02 '19 at 15:38

0 Answers0