How to deal with unbalanced data

Question

I'm doing data analysis with a dataset of 11795 data points (with 88 features). 85% (9973 points) of these data points correspond to data points belonging to class 1, 5% (589 points) belong to class 2 and 10% (1233 points) belong to class 3.

I'm trying to build a model from this data for predicting the class of new data points. I started to wonder if I build my model using this dataset, does it favour the class 1 data points? Would it be difficult for the model to detect the low frequency classes?

Generally how does one tackle unbalanced data sets such as the one I have?

Thank you for any advices =)

P.S.

I'm using k-nearest neighbor and regularized linear regression methods.

jpmuc · Accepted Answer · 2014-06-25T11:15:38.700

How you deal with unbalanced data classes depends on the particular classifier you work with. What classifier are you using?. For this cases, using the one-vs-class strategy has been reported to perform better than a naive approach in this case, since each classifier works with a more balanced data set.

But there are a couple of strategies which are classifier agnostic like stratified sampling and other sampling methods.

P.S. you said, you are using kNNs. One standard approach is to weight vectors according to the distance to the sample. There are a few approaches. I am just familiar with this one. The paper is quite nice.

As for linear regression methods, you may try to regularize your weights to avoid potential overfitting. Again, you could try something like one-vs-all if it applies to the algorithm you are using.

+1 @juampa Thank you for your help! =) I'm using k-nearest neighbor method and regularized linear regression methods. Possibly also trying multilayer perceptron. — jjepsuomi, Jun 25 '14 at 10:46

score 4 · Answer 2 · answered Aug 20 '17 at 08:42

There are several options:

collect more data to try balance your dataset (this is most of the times not possible, consider as example churn analysis)
downsample the majority class
upsample the minority class adding more copies of the positive examples. This is well explained in the Smote paper.
doing some kind of resampling taking examples from the positive and the negative set. This is useful if your are running algorithms based on SGD
another new point coming from Deep Learning can be to use an adversarial-network to generate new examples for the positive class by training an generator and a discriminator. The generator from noise as input generate new examples trying to cheat the discriminator. Vice versa, the discriminator try to be not cheated by the generator.

Hi Fabio, and thank you for your help! Appreciate it very much :) — jjepsuomi, Aug 25 '17 at 08:54

score 3 · Answer 3 · answered Jun 25 '14 at 11:33

Additional to @juampa's answer you have methodes which will create synthetic samples based on real samples to make your dataset more balanced. This is in fact a very interesting topic, a really good summary paper about these techniques can be found here: http://li231-127.members.linode.com/sites/default/files/issues/6-1-2004-06/batista.pdf

One of these algorithmes is SMOTE, a python implementation can be found here: http://comments.gmane.org/gmane.comp.python.scikit-learn/5278

Another is ADASYN. The implementation in python you can find here: https://github.com/ojtwist/ADASYN

score 0 · Answer 4 · answered Jun 22 '18 at 18:18

0

The following link also provides a good analysis of how to deal with unbalanced class: Class imbalance in Supervised Machine Learning

answered Jun 22 '18 at 18:18

CathyQian

121
3

How to deal with unbalanced data

4 Answers4