Applying the "kernel trick" to linear methods?

Question

The kernel trick is used in several machine learning models (e.g. SVM). It was first introduced in the "Theoretical foundations of the potential function method in pattern recognition learning" paper in 1964.

The wikipedia definition says that it is

a method for using a linear classifier algorithm to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, where the linear classifier is subsequently used; this makes a linear classification in the new space equivalent to non-linear classification in the original space.

One example of a linear model that has been extended to non-linear problems is the kernel PCA. Can the kernel trick be applied to any linear model, or does it have certain restrictions?

BTW, kernels are not really essential for SVM's. The "heart" of SVM is the principle of soft margin maximization. Going to kernel representation makes your problem dimensionality O(m^2) instead of O(d) where m is the number of examples and d is the dimension of your feature space, so if m^2 is more than d, you might be better off doing away with kernels http://jmlr.csail.mit.edu/papers/v6/keerthi05a.html — Yaroslav Bulatov, Aug 27 '10 at 17:53
@Yaroslav: Thanks for the reference. Are you aware of any implementations of that "Modified Finite Newton Method"? — Shane, Sep 01 '10 at 15:23
no, but Keerthi and Langford's pages have links to some software that may be related, since they both worked at Yahoo Research — Yaroslav Bulatov, Sep 02 '10 at 03:53

score 17 · Accepted Answer · answered Aug 27 '10 at 17:16

17

The kernel trick can only be applied to linear models where the examples in the problem formulation appear as dot products (Support Vector Machines, PCA, etc).

answered Aug 27 '10 at 17:16

ebony1

2,143
21
13

@Yaroslav, @chl, @Sahne I've started this topic on chat, I think it is more comfortable place to discuss than comments ;-) http://chat.meta.stackoverflow.com/rooms/170/stats – Aug 27 '10 at 22:39
When you say "can only be applied to linear models" do you mean that you can prove that no application exit to something else than linear models? – robin girard Aug 28 '10 at 08:14
What do you mean by "examples" ? as you state it is seems to be something that can be interpreted as a dot product. – robin girard Aug 28 '10 at 08:16

score 7 · Answer 2 · answered Aug 28 '10 at 07:52

Two further references from B. Schölkopf:

Schölkopf, B. and Smola, A.J. (2002). Learning with kernels. The MIT Press.
Schölkopf, B., Tsuda, K., and Vert, J.-P. (2004). Kernel methods in computational biology. The MIT Press.

and a website dedicated to kernel machines.

score 2 · Answer 3 · edited Oct 14 '20 at 20:03

@ebony1 gives the key point (+1), I was a co-author of a paper discussing how to kernelize generalised linear models, e.g. logistic regression and Poisson regression, it is pretty straightforward.

G. C. Cawley, G. J. Janacek and N. L. C. Talbot, Generalised kernel machines, in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2007), pages 1732-1737, Orlando, Florida, USA, August 12-17, 2007. (www,pdf)

I also wrote a (research quality) MATLAB toolbox (sadly no instructions), which you can find here.

Being able to model the target distribution is pretty useful in uncertainty quantification etc. so it is a useful (if fairly incremental) addition to kernel learning methods.

Applying the "kernel trick" to linear methods?

3 Answers3

Linked