21

I’m currently encountering some problems analyzing a tweet dataset with support vector machines. The problem is that I have an unbalanced binary class training set (5:2); which is expected to be proportional to the real class distribution. When predicting I get a low precision (0.47) for the minority class in the validation set; recall is 0.88. I tried to use several oversampling and under-sampling methods (performed on the training set) which did not improve the precision since the validation set is unbalanced as well to reflect the real class distribution. I also implemented different costs in the support vector machine, which helped. Now it seems that I cannot improve my performance anymore.

Does anyone of you have some advice what I could do to improve my precision without hurting my recall? Furthermore, does anyone have a clue why I’m getting way more false positives than false negatives (positive is the minority class)?

Satwik Bhattamishra
  • 1,446
  • 8
  • 24
Filippo Scopel
  • 343
  • 1
  • 2
  • 7
  • 3
    At least part of the problem is evaluating the model on the basis of an improper scoring rule. – Sycorax Mar 22 '16 at 19:28
  • By "oversampling and under-sampling methods", have you tried SMOTE (Synthetic Minority Over-sampling Technique)? From my experience , it improved my classification rate of the minority class for a 300:1 imbalanced dataset. – Matthew Lau Mar 22 '16 at 20:43
  • Hi Matthew, thanks for your reply. I tried multiple oversampling undersampling and even ensembling methods including all kind of SMOTE techniques. – Filippo Scopel Mar 23 '16 at 10:46
  • This suggests to me that the SVM is having a hard time estimating the correct decision boundary. It may well be that the two classes are "mixed up" together in the space of the inputs. Or maybe one class "wraps around" the other. What kernel are you using? – shadowtalker Mar 23 '16 at 13:25
  • Thank you! I think you could be right I figured out that the two cost parameter that I used mainly influence precision and recall. Usually someone would set lower cost for the majority class (such as the parameter 'balanced' does at scikit learn). However, in my case I get lower precision and higher recall when I set lower cost for the majority class. In contrast, when I use lower cost for the minority class I get even worse results. Therefore, it could be that the decision boundary is not linear. I will keep you updated. – Filippo Scopel Mar 23 '16 at 15:05
  • Hi, it is pretty impossible for me to use another kernel function since I used linearSVC as classifier which is way faster than the SVC module with non-linear kernels. Due to scikit documentation the SVC:"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples." I have approx 996 000 training tweets, which slows down the computation enormously. – Filippo Scopel Mar 23 '16 at 16:03
  • 2
    Since you are using scikit, try gradient boosted trees on your data. You'll probably get better precision-recall AUC right out of the box. SVCs, as you point out, are not really practical for anything but very small datasets. – rinspy Jun 08 '17 at 07:47
  • So what is your classification threshold? 0.5? Have you tried using the observed proportion of the positive class as a threshold instead? – g3o2 Aug 26 '17 at 18:52
  • 1
    Hi Filippo! I'm currently dealing with, I would say, exactly the same issue as you describe :-) Tried out all the usual stuff (oversampling/undersampling, SMOTE, class-weight) and even tried several different learners (SVM, Random Forest, fully connected Neural Networks), but the effect is the same everywhere: high recall of the minority class after applying SMOTE or class weight, but very low precision. Did you in the end find a solution? – Ursin Brunner Nov 05 '17 at 08:46

3 Answers3

9

does anyone have a clue why I’m getting way more false positives than false negatives (positive is the minority class)? Thanks in advance for your help!

Because positive is the minority class. There are a lot of negative examples that could become false positives. Conversely, there are fewer positive examples that could become false negatives.

Recall that Recall = Sensitivity $=\dfrac{TP}{(TP+FN)}$

Sensitivity (True Positive Rate) is related to False Positive Rate (1-specificity) as visualized by an ROC curve. At one extreme, you call every example positive and have a 100% sensitivity with 100% FPR. At another, you call no example positive and have a 0% sensitivity with a 0% FPR. When the positive class is the minority, even a relatively small FPR (which you may have because you have a high recall=sensitivity=TPR) will end up causing a high number of FPs (because there are so many negative examples).

Since

Precision $=\dfrac{TP}{(TP+FP)}$

Even at a relatively low FPR, the FP will overwhelm the TP if the number of negative examples is much larger.

Alternatively,

Positive classifier: $C^+$

Positive example: $O^+$

Precision = $P(O^+|C^+)=\dfrac{P(C^+|O^+)P(O^+)}{P(C^+)}$

P(O+) is low when the positive class is small.

Does anyone of you have some advice what I could do to improve my precision without hurting my recall?

As mentioned by @rinspy, GBC works well in my experience. It will however be slower than SVC with a linear kernel, but you can make very shallow trees to speed it up. Also, more features or more observations might help (for example, there might be some currently un-analyzed feature that is always set to some value in all of your current FP).

It might also be worth plotting ROC curves and calibration curves. It might be the case that even though the classifier has low precision, it could lead to a very useful probability estimate. For example, just knowing that a hard drive might have a 500 fold increased probability of failing, even though the absolute probability is fairly small, might be important information.

Also, a low precision essentially means that the classifier returns a lot of false positives. This however might not be so bad if a false positive is cheap.

Ethen
  • 3
  • 1
sjw
  • 5,091
  • 1
  • 21
  • 45
3

Methods to try out:

UnderSampling:

I suggest using under sampling techniques and then training your classifier.

Imbalanced Learning provides a scikit learn style api for imbalanced dataset and should be a good starting point for sampling and algorithms to try out.

Library: https://imbalanced-learn.readthedocs.io/en/stable/

Rank Based SVM :

This has shown to give improvement in recall for high precision systems and is used by google for detecting bad advertisements. I recommend trying it out.

Reference Paper for SVM :

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37195.pdf

Vibhu Jawa
  • 31
  • 3
2

The standard approach would be to weight your error based on class frequency. For example, if you were doing it in Python with sklearn:

model = sklearn.svm.SVC(C=1.0, kernel='linear', class_weight='balanced')
model.fit(X, y)
mprat
  • 143
  • 5