1

I have a highly imbalanced dataset (imbalance ratio 1:100) and I trained a Balanced Radom Forest algorithm. Does reporting PPV and NPV make sense when undersampling the majority class (as is done in balanced random forest)?

I understand that PPV and NPV depend on the prevalence in the true population, but I am not sure if it makes sense to use it in the case of undersampling.

  • 1
    Class imbalance almost certainly is not a problem, and there is no need to use undersampling or oversampling to solve a non-problem. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Oct 26 '21 at 09:30
  • 2
    @astel: true, nobody asked whether unbalanced data are a problem, or whether undersampling would address this. The OP *presumed* it. I would say it is worthwhile pointing out this incorrect underlying assumption, even if, yes, some of us do so almost every day - because these assumptions are apparently ineradicable. If you have a problem with that, you can always open a question here. Until you do so, I would appreciate it if you refrained from snide comments. Thank you. – Stephan Kolassa Oct 26 '21 at 10:04

0 Answers0