0

I have the trainControl function as follows:

fitControl <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5,
                           savePredictions = TRUE,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary,
                           search = "random")

Considering that I have a dataset of 150 samples that have approx 40 features, which values of number and repeats should I choose to make the training more strict?

Also, considering that my samples have a high variation (the groups tested has patterns, but the measurements vary a lot for each sample), is it better or worse to do a more strict training?

1 Answers1

1

Folds

The number of folds is IMHO not a critical decsision for repeated k-fold CV in itself as long as it isn't so large as to hamper repetitios (consider the extreme case of k = n, i.e. LOO: repetitions would always be exactly the same).
See e.g. Choice of K in K-fold cross-validation

The uncertainty on the final pooled performance estimate will be practically speaking the same whether you test 5 surrogate models with 6 test cases each or 10 surrogate models with 3 test cases each.

However, if you are going to use the cross validation results to steer model selection, you need to check whether the heuristic you apply for selecting the best model depends on the number of folds (e.g. by looking at the standard deviation between folds without correcting for test cases per fold).
In that case, number of folds and settings for the heuristic need to be adapted to each other.

Repetitions

Repetitions help with model instability.

  • if you "only" need to show that your training results in stable models, a few (3 - 5) repetitions are sufficient.

  • if you need to improve the uncertainty of the performance estimate because your training results are unstable, you need a larger number of repetitions.

    One heuristic is to go for a sufficient number of repetitions so that the uncertainty due to model instabilitiy is << the uncertainty due to the total number of tested cases.

    Another is to make sure the contribution of instability to the total uncertainty of the performance estimate is << the acceptable uncertainty of the performance estimate.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133