Why doesn't a non-linear kernel improve accuracy in high dimensions compared to a linear kernel?

Question

I read somewhere that if the number of dimensions in your feature set is very high, then a non-linear kernel such as RBF (or any other) may not help in increasing accuracy compared to a linear kernel.

What is an intuitive reason for this?

The same post (I think it was one of the answers on CrossValidated) mentioned that this is a typical case when working with textual data as the number of features are usually very high.

score 6 · Accepted Answer · answered Feb 18 '15 at 11:14

6

The motivation to use kernel functions is to map the data onto a (typically higher dimensional) feature space in which it is easier to separate the data linearly. If the input space is high dimensional, the data is typically already (nearly) separable, so there is no need to map to an even higher dimensional feature space.

In theory, the best possible model you can obtain with an RBF kernel is at least as good as the best possible linear model. In practice, the improvement offered by nonlinear kernels is often not worth the extra computational effort.

answered Feb 18 '15 at 11:14

Marc Claesen

17,399
1
49
70

I think there is more to it than just this. Even though the RBF kernel can yield the same solution as the linear model, and will thus be at least as good **on the training set**, the RBF kernel is more likely to overfit the data (so it would perform poorly on unseen data - even theoretically). – Bitwise Feb 18 '15 at 13:06
That's incorrect. Sure, you *can* overfit with more complex models. Finding the best nonlinear model is more difficult than finding the best linear one, but theoretically the best RBF kernel model performs at least as good as the best linear one in terms of **generalization performance**. – Marc Claesen Feb 18 '15 at 13:10
I am not sure you are correct. Given a fixed (small) amount of data, a more complex model (in terms of VC dimension, for example), can be showed to overfit and be worse in terms of generalization. This is a consequence of VC theory. – Bitwise Feb 20 '15 at 13:23
1

@MarcClaesen a bit late, but I think the point is that while the RBF kernel may be capable in principal of finding a solution at least as good as the linear one, the model selection procedures used to determine the optimal hyperparameters might not be able to find that solution because of over-fitting the model selection criterion (rather than the training criterion). – Dikran Marsupial Sep 20 '16 at 09:19

Why doesn't a non-linear kernel improve accuracy in high dimensions compared to a linear kernel?

1 Answers1

Linked