Handle large set of features using SVM

Question

I have a biological dataset with 30.000 features (genes) and 1000 data points (cells). Basically I have two major classes of cells: 1 and 0 with a distribution of 90/10.

Now I am trying to classify these correctly using nested cross validation. The first thing I tried was to manually decrease the number of features by considering biological relevant subsets of the total feature set (reduced to 20 features), which gives me reasonable results (0.7 F2 score).

However, I am wondering if I use the whole feature set if I will get big overfitting since I have much less data points than features.

Is it true that I would overfit my data if I use the whole feature set? And if so, are there any ways to decrease the feature set without prior biological knowledge?

Thanks a lot! Tomi

Related: http://stats.stackexchange.com/questions/35276/svm-overfitting-curse-of-dimensionality — Danica, Aug 25 '14 at 08:52

Marc Claesen · Answer 1 · 2014-08-25T09:00:38.137

1

SVM does not overfit when using a lot of features, provided that you regularize correctly.

SVM, due to the kernel trick, operates on inner products (in feature space, I'm going to assume you are using a linear kernel). SVM does not estimate coefficients per feature as is done in linear regression but instead estimates coefficients per training instance. Hence, SVMs are less affected by the number of features and the curse of dimensionality.

That said, if you have more features you will likely need to regularize stronger since typically the training errors increase in size, which can induce an overfit (e.g. you probably need to use a lower $C$).

edited Aug 25 '14 at 09:00

answered Aug 25 '14 at 07:03

Marc Claesen

17,399
1
49
70

Your second paragraph isn't really true. The primal SVM problem with a linear kernel does indeed compute a weight for each feature, and even if you solve it in the dual, thanks to strong duality your solution is exactly the same as if you had done it by doing a weight per dimension. SVMs are less susceptible to overfitting in high dimensions when carefully regularized and particularly with an appropriate kernel, but they're definitely not immune. – Danica Aug 25 '14 at 08:54
@Dougal It computes a weight per feature indirectly. Linear SVM *still* computes instance weights just like its kernelized siblings. Since a linear combination of linear inner products can be summarized as a single inner product, the solution of linear SVM *happens to* translate to feature weights. That said, you are correct that they are not immune to overfitting, so I modified the answer slightly to reflect this. – Marc Claesen Aug 25 '14 at 08:59
You can equally well argue that it happens to compute instance weights as a byproduct of feature weights. In the primal, you optimize over $w$ and $b$; in the dual, over $\alpha$. The two are equivalent, and which you solve depends on what software you used. (Typically primal solvers are preferred if available and there are more instances than feature dimensions, but that's a computational issue irrelevant to the actual solution.) – Danica Aug 25 '14 at 09:01
@Dougal Sure, the interpretations are entirely equivalent for the linear kernel, but only there. I prefer the dual perspective because it provides an intuitive understanding of what happens for all kernels. In case of the linear kernel, the primal interpretation is indeed equally interpretable as the dual but for other kernels the primal interpretation is far less obvious (it is certainly not equivalent to computing feature weights directly). – Marc Claesen Aug 25 '14 at 09:08
Yes, absolutely. My point was just that you're making a claim about the behavior of SVMs based on the dual formulation, but your reasoning doesn't apply to the (in the case you're talking about) equivalent primal formulation. – Danica Aug 25 '14 at 09:11
1

[This recent paper](http://papers.nips.cc/paper/5440-the-limits-of-squared-euclidean-distance-regularization.pdf) reminded me of this discussion. In some loose sense, it's evidence that the kernel-mode interpretation is more informative than the feature-weight one, in some settings. – Danica Dec 17 '14 at 17:02

Handle large set of features using SVM

1 Answers1