2

For a training data set (60 positive class samples and 40 negative class samples) of SVM learning algorithm. Are the two oversampling methods the same?

(1) bootstrapping 40 negative samples into 60.

(2) bootstrapping both classes into 500 samples.

This question seems similar to existing questions. But it is not. I am aware that I can do undersampling, oversampling, smote, or cost-sensitive learning. But specifically, I'm asking for SVM algorithm and when both situations are oversampling, is there any difference between the two oversampling methods and which one seems more reasonable?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Yan
  • 206
  • 1
  • 8
  • 60% vs 40% seems perfectly fine without any treatment. – Haitao Du Aug 02 '17 at 20:20
  • @hxd1011 Thanks for the comment! But when I do cross-validation on the trianing set and check the predictions of each testing fold, it seems the best cutoff are some positive value, instead of a default zero (assuming positive class is 1 and negative class is -1). That's why I tried to balance the class in the first place. – Yan Aug 03 '17 at 20:57

1 Answers1

1

Given that you have a "fixed" method that you want to use, you can try both approaches and iterate many times. Afterwards, you can assess in which of the two occasions your model performs better.

Don't forget to share your results! I'm interested to see.

Vasilis Vasileiou
  • 1,158
  • 1
  • 9
  • 16
  • Acutally I've done that and the results seem really similar to each other. That's why I posted the question and wanted a theoretical explanation. – Yan Aug 03 '17 at 20:55
  • Ok. How are you validating your model's performance ? AUC? A specific loss function? – Vasilis Vasileiou Aug 03 '17 at 23:31
  • I did leave one out cross-validation on this data, and every sample get one prediction value from the model built from the rest samples. In every model, I used the two oversampling methods. Then I got two prediction values for every sample (one for one method). Then I plotted the prediction values as points in a graph. The prediction value means of each class are nearly the same by the two oversampling methods. By eye, the points also seemd very similarly distributed. I didn't use any specific metric for evaluation because the results are slightly different if you do the oversampling again. – Yan Aug 03 '17 at 23:42
  • The more you increase the size, the more you "force" the differences to be significant and the more you -roughly speaking- overfit to you dataset. That being said, I would prefer to choose the 40 to 60 case because I don't see any reason against it. Why would you "fix by force" your separation line by giving it more of the same data points? Eg: think that by miracle you get one additional point for the minority and you re-fit your model. The 40 to 60 will adjust better whereas the 500 model will stay untouched. From predictive perspective, the error margin of the 40-60 will be more reliable – Vasilis Vasileiou Aug 04 '17 at 00:11
  • The reason for making the class balanced is to get a prediction cutoff near zero, assuming the positive class is 1 and negative class is -1. Because I saw that predictions for all samples are more toward positive value with 40-60 sample proportion. But after balancing classes in either way, the prediction values did not change much. Do you also have an explanation for that? I guess I should be using the 40-60 then. Thanks! – Yan Aug 04 '17 at 17:37
  • I understand that. I just don't see any reason why you would choose the second option – Vasilis Vasileiou Aug 04 '17 at 18:42