1

I want to estimated residuals (actual - predicted) within a k-fold cross-validation scheme (i.e. predicted residuals) in a regression problem.

The aim is to get a reasonable estimate as the data is high dimensional with p (~ 1000) >> n (< 100). If I fit a model (I am using LASSO) on the whole data then it over-fits and the residuals are close to 0. So I thought using k-fold CV can help, which it does but comes with many questions.

  • What should I use as the value of k? Leave-one-out is over optimistic and thus the residuals will be closer to 0. Should I use the standard 10-fold, how about 5-fold? Fewer folds (e.g. 2) will create bad models due to small training samples and inflate the residuals. Hard to determine a sensible choice of k.
  • Should I repeat the CV procedure and if yes then how to combine the residuals from the repeats? If I just average across repeats then again the residuals will be closer to 0 as it will be something like a bagging and approach LOO in limit.

Any suggestions are welcome. Thanks!

Krrr
  • 476
  • 6
  • 15
  • just to clarify, by residuals you are referring to MSE/MAE right? Normally residuals refer to observed - predicted – StupidWolf Apr 16 '20 at 10:10
  • @StupidWolf I am referring to observed - predicted, edited the question to make it clear! – Krrr Apr 16 '20 at 10:14
  • thank you for clarifying.. that's quite a bit of data to collect. If you do cross-validation, you get 1 residual for each data point (regardless of k) because each fold is only predicted once. you need to repeat the CV procedure – StupidWolf Apr 16 '20 at 10:17
  • @StupidWolf thanks! Yes I am aware of that but any choice of k and number of repeats is problematic, so the question is which choices are sensible? – Krrr Apr 16 '20 at 16:22
  • Not very sure i understand the problem, but naively, you can do one round of k-fold validation to see whether the overfitting (if any starts). i.e just compare mse across different k-fold validation. Not very sure for your data, but hopefully you find something sensible this way – StupidWolf Apr 16 '20 at 16:28
  • Now as for the number of repeats, this is a bit more straight forward, how accurate do you want to be about the error? The more you do, the more accurate you are.. but yeah you just have to run it for what you wanna do... – StupidWolf Apr 16 '20 at 16:30
  • @StupidWolf: more repetitions do not necessarily make the error estmiate more accurate (not even more precise): there are several sources of error that all contribute to the generalization error. Repetitions can help with the error source "model instability", but e.g. it won't do anything with the error source "limited number of independent cases tested". If the models are stable, repetitions won't make the estimate any better - but they be used to show that the models are stable. – cbeleites unhappy with SX Apr 16 '20 at 17:19

1 Answers1

1

For choice of $k$, see Choice of K in K-fold cross-validation


Repeating:

  • yes, I definitively recommend at least a few repetitions since this allows you to determine stability (e.g. as standard deviation of predictions for the same test case in different repetitions).

    If the variation there is negligible, your models are stable, and further repetitions would not change anything.

    If the variation is substantial, the [surrogate] models are unstable. More repetitions then give a better estimate of the average error of the surrogate models, but that does not necessarily help with estimating generalization error of the model trained on the whole data set since that is likely subject to almost the same instability.

  • Calculating pooled error across residuals from folds and repetitions is not like an out-of-bag error: out-of-bag error would pool/average the predictions for the same case and then calculate one residual (average prediction - refererence) per case.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Thanks for your answer. Perhaps my question is not clear. The selection of k you describe is good for estimating the generalization error but would that be the same for estimating the expected residuals of the data at hand? – Krrr Apr 16 '20 at 19:36