0

Question has to do with the interpretation of output of the caret package .

  • Subsampling (either up or down) is set up in caret trainControl(). In the present example subsampling is set to "up" although the issue is the same when it is set to "down"
trControl <- trainControl(method  = "repeatedcv",
                          verboseIter = TRUE,
                          number  = 10,
                          repeats = 50,
                          savePredictions = "final",
                          classProbs = TRUE,
                          sampling = "up")
  • Model is trained. Everything ok.
fit.LDA <- caret::train(frmla, 
                         Data.Frame, 
                         method = "lda",
                         preProc = c("center", "scale"), 
                         trControl = trControl)
  • However, when checking the confusión matrix using
confusionMatrix(fit.LDA$pred$pred, fit.LDA$pred$obs)

The value of no information rate remains the same as the one from the original dataset:
No Information Rate : 0.5691

  • Question: shouldn't the no information rate be equal to 0.5 since both classes have been balanced?

When performing upsampling using groupdata2::upsample(Data.Frame, "Class") and training the model in caret the no information rate value is of 0.5 (caret can also be used for this purpose with caret::upSample().

It is not a trivial question since Kuhn (https://topepo.github.io/caret/subsampling-for-class-imbalances.html) shows that performing subsampling durng resampling results in model performance metrics more similar to the ones from the test set (instead of performing subsampling out of resampling such as with groupdata2).

  • 2
    "Imbalanced" data are not a problem if you use appropriate quality measures. [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Nov 22 '21 at 11:07
  • 2
    Subsampling for class imbalance represents a deep misunderstanding of statistics. See fharrell.com/post/classification and fharrell.com/post/mlconfusion. – Frank Harrell Nov 22 '21 at 13:46

0 Answers0