How to handle data imbalance in classification?

Question

I am working on a text classification problem. My data is highly imbalanced. For example, one category has 700 documents while the other has 30. I have around 30 categories. I tried different classifiers and the performance is consistently poor.

What is the best way to tackle this issue? Thanks

See the related questions on the right hand side of the screen. Here is a duplicate question, [For a classification problem if class variable has unequal distribution which technique we should use?](http://stats.stackexchange.com/q/2131/1036) — Andy W, Sep 26 '11 at 12:09
When you say "performance is consistently poor" it wou,d be helpful if you explained the poor performance more precisely. — Karl, Sep 26 '11 at 23:36
[many similar posts](https://stats.stackexchange.com/search?q=classi*+unbalanc*+answers%3A1) — kjetil b halvorsen, Dec 20 '19 at 03:18

score 3 · Answer 1 · answered Dec 20 '12 at 08:38

As @madness said, changing from a classification problem into a probability estimation problem could be a common solution. To properly estimate the probability, we have to choose a proper loss functions. Logistic loss and MSE are two common ones. They are both proper(dive into Savage's old paper "The elicitation of personal probabilities and expectations" for details).

From a pratical viewpoint, I would ask whether your testing set( or whatever dataset/scenario you want to apply your model to) has the same positive(or negative) sample ratio(700/730) as your training set. If it is not the case, resampling to match the testing set could be a better solution.

score 2 · Answer 2 · answered Oct 21 '12 at 02:17

The question is which loss function you have. Most classifiers are created to minimize a 0-1 loss, that is, they assume that the loss of classifying A as B is the same as B and A. If this is really your loss function, you should be happy with classifying all sample as being from the majority group. That is, in your case this silly classifier gets the answer right 700/730 of the times.

So, from a practical point of view and easy way to go is to change your loss function. This can be easily implemented by using plug in classifiers such as logisic regression, where you estimated the probability $P(Y=1|x)$. The usual rule is to compare the estimated probability $\widehat{P}(Y=1|x)$ with $\frac{1}{2}$. This is motivated by the 0-1 loss. Different losses produce different cutoffs. So what you can do is to change this cutoff. Usually setting it to be the prior itself (30/730) gives reasonable results. I sugest using a ROC Curve to define the cutoff.

Why would you use a prior for a cut-off for probability? This seems completely arbitrary. It will result in 50% of your examples predicted as class 1, so if that's what you want, it's correct. But I can't imagine why it would be a "reasonable" approach. — max, Jun 09 '16 at 17:34

score 0 · Answer 3 · answered Apr 02 '15 at 12:01

To handle class-imbalance problem, you can use either of the following:-
1. Under/over sampling
2. Cost-sensitive learning
3. Boosting

Alternatively, you can design loss functions e.g. ramp loss and use it within your model for learning from imbalanced data.

score 0 · Answer 4 · answered Dec 20 '12 at 08:50

You may want to give random forests a try or bagging/boosting methods (I'm assuming you haven't tried these yet at this stage).

Usually these techniques work well for these kinds of problems. It will depend, if the classes are imbalanced but if they differ in characteristics then a classification method should have no problem distinguishing between them.

In some cases however the classifier might be the cause of the imbalance. You can scale the predictions however based on the confusion matrix. The main idea is listed here: http://www.creatapreneur.com/2012/07/.

How to handle data imbalance in classification?

4 Answers4

Linked