Will training a classifier on data where one class is more present than another going to mess everything up?

Question

Let's say I have a dataset where we have examples belonging to either class A or class B but we have more examples belonging to class B than A.

If I train a classifier on this data will it be useless?

[Some argue that class imbalance is no issue at all and that "correcting" for it is a poor strategy.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) I tend to find myself in this camp. — Dave, Oct 25 '21 at 21:11

score 1 · Answer 1 · answered Oct 25 '21 at 21:00

1

Training a classifier on data in which there is a large class imbalance can lead to problematic behaviour from the classifier. For example, using the popular classification rule of $p<0.5 \implies 0$ else 1 in logistic regression for an imbalanced classification can lead to all the classifications being 0/1 (depending on which class is most prevalent).

In this case, it might be better to estimate the probability that each observation belongs to each class. This gives you the added flexibility of creating your own cutoff should you want it, or better yet using the probability directly in any downstream decisions.

answered Oct 25 '21 at 21:00

Demetri Pananos

24,380
1
36
94

Some people consider the thresholding step to be a category error: [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Oct 26 '21 at 05:42
@Stephan I agree, as evidenced by my post history. Nevertheless, it is an option for OP – Demetri Pananos Oct 26 '21 at 13:06

Will training a classifier on data where one class is more present than another going to mess everything up?

1 Answers1