I have used five different algorithms: bagging, boosting, C4.5, random forests and SVM, for binary classification of biological data relating to peptide sequence. The dataset comprised of approximately 1000 sample (evenly distributed in two groups, negative and positive) and 126 attributes. The performance of each classifier was assessed in terms of ACC, SEN, SPC and MCC and SVM surpassed in all performance values. Bagging had the second best performance. I want to know what makes SVM more appropriate for binary classification especially when dealing with biological data?
Asked
Active
Viewed 249 times
1
-
1could you provide a reference making that claim? This answer provides some hints (http://stats.stackexchange.com/questions/35276/svm-overfitting-curse-of-dimensionality). I am not familiar with biological data, but I guess has high dimensionality, and maybe you do not have that many samples. SVMs behave well in that setting. – jpmuc Jun 27 '15 at 14:00
-
@juampa i am not making a claim, the results i attained from classifying with 5 different algorithms, SVM, RF, C4.5, Adaboost and Bagging and SVM had higher overall performance (higher ACC, SEN, SPC, MCC). I want to know why that is. – Maryam Jun 27 '15 at 14:10
-
2Sorry if I seemed rough. I understood it as if you meant all kinds of biological data. As for your question, how is your data? Does it fit the description of my first comment? i.e. how many features? how many samples? are there many correlated features? – jpmuc Jun 27 '15 at 14:16
-
@jaumpa No need for apology. I major in biotechnology so I'm not very familiar with machine learning. I had about 1000 samples with 126 attributes which are all independent. I also attained high results with the Bagging model, but SVM was better. – Maryam Jun 27 '15 at 14:25
-
Maybe some features are not helpful. In this other answer (http://stats.stackexchange.com/questions/158759/bad-results-using-bayes-multinomial-navie-in-multi-label-classification-texts/158777#158777) I point out some issues related with highly correlated variables. Maybe that applies to your problem? I would try to understand why bagging helps so much, that is, what is the reason for that instability. It should be possible to guess that from your data. – jpmuc Jun 27 '15 at 17:07
-
I think this is potentially a good question. To clarify some of the things I think are in need of further explication and support; the claim is extraordinarily broad ("biological data" covers a host of things) and based on what appears to be very thin, anecdotal evidence of only a few cases for such a broad claim. As such, either the question needs to be narrowed to the cases actually in evidence, or as already outlined in comments the broader claim needs support -- otherwise outside of "why would anyone think this is generally the case?" seems to be the obvious response. ... (ctd) – Glen_b Jun 29 '15 at 03:59
-
(ctd) ... (I was somewhat torn between choosing "too broad" and "unclear".) Some of the other voters may like to add some of their own reasons for feeling that the question should be put on hold. – Glen_b Jun 29 '15 at 03:59
-
Nevertheless, for the moment I've reopened the question, but additional clarification/improvement is needed to make a suitable suqestion – Glen_b Jun 29 '15 at 04:08
1 Answers
2
Usually one would expect a method to perform better if its assumptions are met, and if that holds for several methods, the one which makes the most stringent assumptions. The problem with many machine learning algorithms is that they do not have explicit assumptions. Because of this it is almost entirely an empirical question which method performs best on given data. Meaning, there's no real answer to your question.

A. Donda
- 2,819
- 14
- 32