20

The kernel trick is used in several machine learning models (e.g. SVM). It was first introduced in the "Theoretical foundations of the potential function method in pattern recognition learning" paper in 1964.

The wikipedia definition says that it is

a method for using a linear classifier algorithm to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, where the linear classifier is subsequently used; this makes a linear classification in the new space equivalent to non-linear classification in the original space.

One example of a linear model that has been extended to non-linear problems is the kernel PCA. Can the kernel trick be applied to any linear model, or does it have certain restrictions?

Danica
  • 21,852
  • 1
  • 59
  • 115
Shane
  • 11,961
  • 17
  • 71
  • 89
  • 2
    BTW, kernels are not really essential for SVM's. The "heart" of SVM is the principle of soft margin maximization. Going to kernel representation makes your problem dimensionality O(m^2) instead of O(d) where m is the number of examples and d is the dimension of your feature space, so if m^2 is more than d, you might be better off doing away with kernels http://jmlr.csail.mit.edu/papers/v6/keerthi05a.html – Yaroslav Bulatov Aug 27 '10 at 17:53
  • @Yaroslav: Thanks for the reference. Are you aware of any implementations of that "Modified Finite Newton Method"? – Shane Sep 01 '10 at 15:23
  • no, but Keerthi and Langford's pages have links to some software that may be related, since they both worked at Yahoo Research – Yaroslav Bulatov Sep 02 '10 at 03:53

3 Answers3

17

The kernel trick can only be applied to linear models where the examples in the problem formulation appear as dot products (Support Vector Machines, PCA, etc).

ebony1
  • 2,143
  • 21
  • 13
  • @Yaroslav, @chl, @Sahne I've started this topic on chat, I think it is more comfortable place to discuss than comments ;-) http://chat.meta.stackoverflow.com/rooms/170/stats –  Aug 27 '10 at 22:39
  • When you say "can only be applied to linear models" do you mean that you can prove that no application exit to something else than linear models? – robin girard Aug 28 '10 at 08:14
  • What do you mean by "examples" ? as you state it is seems to be something that can be interpreted as a dot product. – robin girard Aug 28 '10 at 08:16
7

Two further references from B. Schölkopf:

and a website dedicated to kernel machines.

chl
  • 50,972
  • 18
  • 205
  • 364
2

@ebony1 gives the key point (+1), I was a co-author of a paper discussing how to kernelize generalised linear models, e.g. logistic regression and Poisson regression, it is pretty straightforward.

G. C. Cawley, G. J. Janacek and N. L. C. Talbot, Generalised kernel machines, in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2007), pages 1732-1737, Orlando, Florida, USA, August 12-17, 2007. (www,pdf)

I also wrote a (research quality) MATLAB toolbox (sadly no instructions), which you can find here.

Being able to model the target distribution is pretty useful in uncertainty quantification etc. so it is a useful (if fairly incremental) addition to kernel learning methods.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178