80

There are several distinct usages:

  • kernel density estimation
  • kernel trick
  • kernel smoothing

Please explain what the "kernel" in them means, in plain English, in your own words.

Danica
  • 21,852
  • 1
  • 59
  • 115
Neil McGuigan
  • 9,292
  • 13
  • 54
  • 62
  • 3
    Not to be rude, but isn't this a question that is already answered ad nausea on Wikipedia and the likes? Google gave me the answer within 15 seconds... – Joris Meys Sep 09 '10 at 08:32
  • 51
    I absolutely hate wikipedia answers for stats. There are rambling, symbolic messes. I am looking for a gem of an answer that can explain the answer in plain English, as I believe that that shows a deeper level of understanding than a math equation. There are many popular "plain English" questions on here, and for good reason. – Neil McGuigan Sep 09 '10 at 18:04

2 Answers2

46

There appear to be at least two different meanings of "kernel": one more commonly used in statistics; the other in machine learning.

In statistics "kernel" is most commonly used to refer to kernel density estimation and kernel smoothing.

A straightforward explanation of kernels in density estimation can be found (here).

In machine learning "kernel" is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem "by mapping the original non-linear observations into a higher-dimensional space".

A simple visualisation might be to imagine that all of class $0$ are within radius $r$ of the origin in an x, y plane (class $0$: $x^2 + y^2 < r^2$); and all of class $1$ are beyond radius $r$ in that plane (class $1$: $x^2 + y^2 > r^2$). No linear separator is possible, but clearly a circle of radius $r$ will perfectly separate the data. We can transform the data into three dimensional space by calculating three new variables $x^2$, $y^2$ and $\sqrt{2}xy$. The two classes will now be separable by a plane in this 3 dimensional space. The equation of that optimally separating hyperplane where $z_1 = x^2, z_2 = y^2$ and $z_3 = \sqrt{2}xy$ is $z_1 + z_2 = 1$, and in this case omits $z_3$. (If the circle is off-set from the origin, the optimal separating hyperplane will vary in $z_3$ as well.) The kernel is the mapping function which calculates the value of the 2-dimensional data in 3-dimensional space.

In mathematics, there are other uses of "kernels", but these seem to be the main ones in statistics.

mkt
  • 11,770
  • 9
  • 51
  • 125
Thylacoleo
  • 4,829
  • 5
  • 24
  • 32
  • 1
    Very nice! I'm going to use your example with the circle to explain kernel methods, as it is the best visualization I met up til now. Thanks! – Joris Meys Sep 10 '10 at 13:47
  • 2
    The following video was proposed by an anonymous potential editor as "a great visualisation of what Thylacoleo explained:" http://www.youtube.com/watch?v=3liCbRZPrZA – gung - Reinstate Monica Jun 06 '13 at 22:56
  • 1
    Following up Thylacoleo's example using the circle to explain the kernel trick (i don't have enough reputation to add a comment directly to his answer) Was there a simple typo in the equation for the separating hyperplane? and it should be z1 + z2 = r^2, instead of z1 + z2 = 1? Or do I misunderstand? I agree its a nice simple example to illustrate the concept. Thanks. Though the definition of z3 still seems a bit of a mystery, but apparently it doesn't matter for the example centered at the origin. – Alex Blakemore Sep 11 '10 at 01:37
  • Yes there was a typo. Thanks for that Alex. I don't always proofread :-) – Thylacoleo Sep 13 '10 at 07:26
  • Do we use inner products t o map 2-dimensional data to 3-dimensional? – SmallChess Nov 09 '15 at 03:45
41

In both statistics (kernel density estimation or kernel smoothing) and machine learning (kernel methods) literature, kernel is used as a measure of similarity. In particular, the kernel function $k(x,.)$ defines the distribution of similarities of points around a given point $x$. $k(x,y)$ denotes the similarity of point $x$ with another given point $y$.

Tim
  • 108,699
  • 20
  • 212
  • 390
ebony1
  • 2,143
  • 21
  • 13
  • This is a nice way of putting it. I am wondering if you can generalize this description to also apply to the kernel of 'kernel density estimation'. – shabbychef Sep 09 '10 at 16:18
  • 2
    In a way, yes. One way to understand kernel density estimation is that you approximate the density of a point from some distribution as a weighted average of its similarities with a set of points from the distribution. So the notion of similarity does play a role here as well. – ebony1 Sep 09 '10 at 17:14
  • 1
    I understand "kernel" in statistics to be borrowed originally from jargon used in discussion of integral equations. – Nick Cox Sep 16 '14 at 14:47