14

I would like to train an SVM to classify cases (TRUE/FALSE) based on 20 attributes. I know that some of those attributes are highly correlated. Therefore my question is: is SVM sensitive to the correlation, or redundancy, between the features? Any reference?

Danica
  • 21,852
  • 1
  • 59
  • 115
user7064
  • 1,685
  • 5
  • 23
  • 39
  • My guess would be no, since generating a separation based on one variable would make the other correlated variables weak regarding further separations. There might be some instability regarding which variable is chosen, however. – mandata May 04 '15 at 14:28
  • Are you talking about a linear SVM, or RBF kernel, or...? – Danica May 05 '15 at 05:26
  • Hmmmm, I don't know... does the answer depend on that? – user7064 May 05 '15 at 05:27
  • Yes, absolutely. You can design a kernel to explicitly deal with the correlations, if you'd like. – Danica May 05 '15 at 05:40
  • 1
    @Dougal: If there are methods to eliminate the effect of correlation, doesn't that imply that standard SVM is sensitive to correlation? – cfh May 05 '15 at 10:48

1 Answers1

14

Linear kernel: The effect here is similar to that of multicollinearity in linear regression. Your learned model may not be particularly stable against small variations in the training set, because different weight vectors will have similar outputs. The training set predictions, though, will be fairly stable, and so will test predictions if they come from the same distribution.

RBF kernel: The RBF kernel only looks at distances between data points. Thus, imagine you actually have 11 attributes, but one of them is repeated 10 times (a pretty extreme case). Then that repeated attribute will contribute 10 times as much to the distance as any other attribute, and the learned model will probably be much more impacted by that feature.

One simple way to discount correlations with an RBF kernel is to use the Mahalanobis distance: $d(x, y) = \sqrt{ (x - y)^T S^{-1} (x - y) }$, where $S$ is an estimate of the sample covariance matrix. Equivalently, map all your vectors $x$ to $C x$ and then use the regular RBF kernel, where $C$ is such that $S^{-1} = C^T C$, e.g. the Cholesky decomposition of $S^{-1}$.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • This is a very interesting answer; I'd like to read more about how to mitigate these kinds of problems. Can you add a reference or two? – Sycorax May 05 '15 at 17:45
  • I don't know a good one off-hand, but I'll look around a bit for one, perhaps tonight. – Danica May 05 '15 at 17:45
  • Awesome! Inbox me if you happen to find a cool article. I'm glad that my (+1) could put you over 3k. (-: – Sycorax May 05 '15 at 17:47
  • 1
    The inverse of covariance matrix in Mahalanobis distance is a key. If you can estimate it reliably, this effected can be accounted for. – Vladislavs Dovgalecs May 05 '15 at 19:00