7

I have been looking into the imbalanced learning problem, where a classifier is often expected to be unduly biased in favour of the majority class. However, I am having difficulties identifying datasets where class imbalance is genuinely a problem and furthermore, where it is actually a problem, that it can be fixed by re-sampling or re-weighting the data.

Can anyone give reproducible examples of real-world (not synthetic) datasets where re-sampling or re-weighting can be used to improve the accuracy (or equivalently misclassification error rate) for some particular classifier system (when applied in accordance with best practice)?

I am only interested in accuracy as the performance measure. There are some tasks where accuracy is the quantity of interest in the application (see e.g. my answer to a related question), so I would appreciate it if there were no digressions onto the topic of proper scoring rules, or other performance measures.

It is not an example of the class imbalance problem if the operational class frequencies are different to those in the training set or the misclassification costs are not equal. Cost-sensitive learning is a different issue.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/132903/discussion-on-question-by-dikran-marsupial-are-there-imbalanced-learning-problem). – Sycorax Jan 05 '22 at 14:55
  • 1
    We should never use AUC as an objective function but rather use full information continuous measures such as deviance. And avoid classification at all costs, by using probability models, unless you are in an ultra-high signal:noise ratio situation such as playing games or simple pattern recognition in ML. – Frank Harrell Jan 05 '22 at 14:55
  • @Sycorax you moved far too much of it to chat. – Frank Harrell Jan 05 '22 at 14:57

0 Answers0