SVM for unbalanced data

Question

I want to attempt to use Support Vector Machines (SVMs) on my dataset. Before I attempt the problem though, I was warned that SVMs dont perform well on extremely unbalanced data. In my case, I can have as much as 95-98% 0's and 2-5% 1's.

I tried to find resources which talked about using SVMs on sparse/unbalanced data, but all I could find was 'sparseSVMs' (which use a small amount of support vectors).

I was hoping someone could briefly explain:

How well SVM would be expected to do with such a dataset
Which, if any, modifications must be done to the SVM algorithm
What resources/papers discuss this

score 18 · Accepted Answer · answered Apr 18 '14 at 20:57

18

Many SVM implementations address this by assigning different weights to positive and negative instances. Essentially you weigh the samples so that the sum of the weights for the positives will be equal to that of the negatives. Of course, in your evaluation of the SVM you have to remember that if 95% of the data is negative, it is trivial to get 95% accuracy by always predicting negative. So you have to make sure your evaluation metrics are also weighted so that they are balanced.

Specifically in libsvm, which you added as a tag, there is a flag that allows you to set the class weights (-w I believe, but check the docs).

Finally, from personal experience I can tell you that I often find that an SVM will yield very similar results with or without the weight correction.

answered Apr 18 '14 at 20:57

Bitwise

6,379
2
22
27

Beat me to it :-) – Marc Claesen Apr 18 '14 at 20:58
@Bitwise I have the same problem of imbalanced data and I get an Accuracy of 99%. I used the weights in libsvm. You mentioned that the evaluation metrics must also be weighted. I wanted to know how can we weight the valuation metrics. – Hani Goc May 17 '16 at 13:47
3

@HaniGoc basically you want to separately calculate the accuracy for each class, and take the average of that. So for example, if you have 10 class A and 90 class B and you guessed all samples to be class B, in standard accuracy you would have $90/100 = 0.9$, but in the weighted accuracy you would have $0.5*(0/10+90/90) = 0.5$. – Bitwise May 17 '16 at 19:14

score 8 · Answer 2 · answered Apr 18 '14 at 20:58

8

SVMs work fine on sparse and unbalanced data. Class-weighted SVM is designed to deal with unbalanced data by assigning higher misclassification penalties to training instances of the minority class.

answered Apr 18 '14 at 20:58

Marc Claesen

17,399
1
49
70

score 5 · Answer 3 · answered Apr 25 '14 at 08:21

5

In the case of sparse data like that SVM will work well.

As stated by @Bitwise you should not use accuracy to measure the performance of the algorithm.

Instead you should calculate the precision, recall and F-Score of the algorithm.

answered Apr 25 '14 at 08:21

alexandrekow

253
1
3
8

May you please expand on your reasoning? Also, how would you go about measuring the F-score once the classification (on the test set) has completed? Thanks – Spacey Nov 25 '14 at 00:44
To measure the FScore on the test set you will need to manually classify it, and then compute recall and precision using the manual data vs the predicted data. What would you like me to expand, why SVM works well with sparse data? – alexandrekow Nov 25 '14 at 09:15
Yes, why SVM works on sparse data would be nice as well. Thanks – Spacey Nov 25 '14 at 21:49
"Simply having sparse features does not present any problem for the SVM. One way to see this is that you could do a random rotation of the co-ordinate axes, which would leave the problem unchanged and give the same solution, but would make the data completely non-sparse (this is in part how random projections work" (http://stats.stackexchange.com/questions/23470/does-a-sparse-training-set-adversely-affect-an-svm) – alexandrekow Nov 26 '14 at 13:18

SVM for unbalanced data

3 Answers3

Linked

Related