1

Let's say I have a dataset where we have examples belonging to either class A or class B but we have more examples belonging to class B than A.

If I train a classifier on this data will it be useless?

WindBreeze
  • 131
  • 4
  • 1
    [Some argue that class imbalance is no issue at all and that "correcting" for it is a poor strategy.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) I tend to find myself in this camp. – Dave Oct 25 '21 at 21:11

1 Answers1

1

Training a classifier on data in which there is a large class imbalance can lead to problematic behaviour from the classifier. For example, using the popular classification rule of $p<0.5 \implies 0$ else 1 in logistic regression for an imbalanced classification can lead to all the classifications being 0/1 (depending on which class is most prevalent).

In this case, it might be better to estimate the probability that each observation belongs to each class. This gives you the added flexibility of creating your own cutoff should you want it, or better yet using the probability directly in any downstream decisions.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94