Should we remove duplicates when training a SVM?

Question

Let's say I have a set P of positive examples and a set N of negative examples.

Prior to feeding this dataset to a SVM for training, should I remove duplicates in those sets?

Intuitively, I don't think that showing several times the same example adds much information. While I understand it could change the weight attached to that example.

A slightly related question: how should I handle examples which are present in both sets?

Because, once objects from the real world are projected into the feature space, there is some information loss, so a p from P and a n from M might have the same coordinates in the feature space (but different labels).

I am thinking about removing such examples from the P set. But, maybe they could be removed from both sets. And, I could also understand that someone with a small P set would just want to remove them from the N set (to not loose any precious positive example).

Are SVMs smart and robust enough to handle such cases automatically?

I'm curious about that statement -- "once objects in the real world are projected...they may have the same coordinates." Did you have a specific example in mind? — bibliolytic, Oct 26 '17 at 06:34
Encodings in my field are lossy (sometimes very lossy but still useful). A vector representing a real world object is always only an approximation of the real world object. If two real world objects are close (but were assigned different labels by the expert), they might end up in the same coordinates after encoding. — daruma, Oct 26 '17 at 06:38

bibliolytic · Answer 1 · 2017-10-26T10:11:01.630

4

Duplicates: for an SVM it is actually only important if you are using a soft-margin SVM. A hard-margin SVM is required to perfectly separate your training data, so it doesn't matter how many duplicates and such there are (the downside being, the algorithm fails if it cannot perfectly classify your data). For a soft-margin SVM, it is allowed to ignore a few samples in constructing the decision boundary (how many is dependent on a hyperparameter) -- if there are multiple points at the same location it would have to ignore all of them or none of them, effectively pushing the algorithm to find a solution that correctly classifies the duplicated point. So, if duplication is something that naturally arises in your data it is a good thing, but a hard-margin SVM would not -- algorithmically -- notice it.
If an example is present in both sets then you should definitely take it out of one, assuming you (a) know they are identical in advance and (b) you also know which set to take it out of. Otherwise a soft-margin SVM should be reasonably robust to them (see 1), but too many will make your results heavily hyperparameter-dependent.

With regards to class imbalance see the answer here: class imbalance in SVMs

edited Oct 26 '17 at 10:11

answered Oct 26 '17 at 06:33

bibliolytic

589
3
10

I am already dealing efficiently with class imbalance. I am using libsvm and the kernel I am using has a C parameter. Does this C parameter implies that libsvm is implementing soft-margin SVMs? – daruma Oct 26 '17 at 06:40
Yes. It's somewhat rare to use hard-margin SVMs because they're just a particular value of C (normally 0) anyway. – bibliolytic Oct 26 '17 at 06:53
Note that some questions (not specific to SVMs though) tell to not remove duplicates in your dataset: https://stats.stackexchange.com/questions/23143/remove-duplicates-from-training-set-for-classification and https://stackoverflow.com/questions/26197700/should-i-keep-remove-identical-training-examples-that-represent-different-object – daruma Oct 26 '17 at 07:24
Edit to clarify my point one. Or did you mean point 2? – bibliolytic Oct 26 '17 at 10:11

Should we remove duplicates when training a SVM?

1 Answers1