38

How to prove that the radial basis function $k(x, y) = \exp(-\frac{||x-y||^2)}{2\sigma^2})$ is a kernel? As far as I understand, in order to prove this we have to prove either of the following:

  1. For any set of vectors $x_1, x_2, ..., x_n$ matrix $K(x_1, x_2, ..., x_n)$ = $(k(x_i, x_j))_{n \times n}$ is positive semidefinite.

  2. A mapping $\Phi$ can be presented such as $k(x, y)$ = $\langle\Phi(x), \Phi(y)\rangle$.

Any help?

Danica
  • 21,852
  • 1
  • 59
  • 115
Leo
  • 2,484
  • 3
  • 22
  • 29
  • 1
    Just to link it more obviously: the feature map is also discussed [in this question](https://stats.stackexchange.com/questions/69759/feature-map-for-the-gaussian-kernel), particularly [Marc Claesen's answer](http://stats.stackexchange.com/a/69767/9964) based on Taylor series and [mine](http://stats.stackexchange.com/a/145439/9964) which discusses both the RKHS and the general version of the $L_2$ embedding given by Douglas below. – Danica Jun 05 '15 at 09:04

3 Answers3

29

I'll add a third method, just for variety: building up the kernel from a sequence of general steps known to create pd kernels. Let $\mathcal X$ denote the domain of the kernels below and $\varphi$ the feature maps.

  • Scalings: If $\kappa$ is a pd kernel, so is $\gamma \kappa$ for any constant $\gamma > 0$.

    Proof: if $\varphi$ is the feature map for $\kappa$, $\sqrt\gamma \varphi$ is a valid feature map for $\gamma \kappa$.

  • Sums: If $\kappa_1$ and $\kappa_2$ are pd kernels, so is $\kappa_1 + \kappa_2$.

    Proof: Concatenate the feature maps $\varphi_1$ and $\varphi_2$, to get $x \mapsto \begin{bmatrix}\varphi_1(x) \\ \varphi_2(x)\end{bmatrix}$.

  • Limits: If $\kappa_1, \kappa_2, \dots$ are pd kernels, and $\kappa(x, y) := \lim_{n \to \infty} \kappa_n(x, y)$ exists for all $x, y$, then $\kappa$ is pd.

    Proof: For each $m, n \ge 1$ and every $\{ (x_i, c_i) \}_{i=1}^m \subseteq \mathcal{X} \times \mathbb R$ we have that $\sum_{i=1}^m c_i \kappa_n(x_i, x_j) c_j \ge 0$. Taking the limit as $n \to \infty$ gives the same property for $\kappa$.

  • Products: If $\kappa_1$ and $\kappa_2$ are pd kernels, so is $g(x, y) = \kappa_1(x, y) \, \kappa_2(x, y)$.

    Proof: It follows immediately from the Schur product theorem, but Schölkopf and Smola (2002) give the following nice, elementary proof. Let $$ (V_1, \dots, V_m) \sim \mathcal{N}\left( 0, \left[ \kappa_1(x_i, x_j) \right]_{ij} \right) \\ (W_1, \dots, W_m) \sim \mathcal{N}\left( 0, \left[ \kappa_2(x_i, x_j) \right]_{ij} \right) $$ be independent. Thus $$\mathrm{Cov}(V_i W_i, V_j W_j) = \mathrm{Cov}(V_i, V_j) \,\mathrm{Cov}(W_i, W_j) = \kappa_1(x_i, x_j) \kappa_2(x_i, x_j).$$ Covariance matrices must be psd, so considering the covariance matrix of $(V_1 W_1, \dots, V_n W_n)$ proves it.

  • Powers: If $\kappa$ is a pd kernel, so is $\kappa^n(x, y) := \kappa(x, y)^n$ for any positive integer $n$.

    Proof: immediate from the "products" property.

  • Exponents: If $\kappa$ is a pd kernel, so is $e^\kappa(x, y) := \exp(\kappa(x, y))$.

    Proof: We have $e^\kappa(x, y) = \lim_{N \to \infty} \sum_{n=0}^N \frac{1}{n!} \kappa(x, y)^n$; use the "powers", "scalings", "sums", and "limits" properties.

  • Functions: If $\kappa$ is a pd kernel and $f : \mathcal X \to \mathbb R$, $g(x, y) := f(x) \kappa(x, y) f(y)$ is as well.

    Proof: Use the feature map $x \mapsto f(x) \varphi(x)$.

Now, note that \begin{align*} k(x, y) &= \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right) \\&= \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x \rVert^2 \right) \exp\left( \tfrac{1}{\sigma^2} x^T y \right) \exp\left( - \tfrac{1}{2 \sigma^2} \lVert y \rVert^2 \right) .\end{align*} Start with the linear kernel $\kappa(x, y) = x^T y$, apply "scalings" with $\frac{1}{\sigma^2}$, apply "exponents", and apply "functions" with $x \mapsto \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x \rVert^2 \right)$.

Danica
  • 21,852
  • 1
  • 59
  • 115
29

Zen used method 1. Here is method 2: Map $x$ to a spherically symmetric Gaussian distribution centered at $x$ in the Hilbert space $L^2$. The standard deviation and a constant factor have to be tweaked for this to work exactly. For example, in one dimension,

$$ \int_{-\infty}^\infty \frac{\exp[-(x-z)^2/(2\sigma^2)]}{\sqrt{2 \pi} \sigma} \frac{\exp[-(y-z)^2/(2 \sigma^2)}{\sqrt{2 \pi} \sigma} dz = \frac{\exp [-(x-y)^2/(4 \sigma^2)]}{2 \sqrt \pi \sigma}. $$

So, use a standard deviation of $\sigma/\sqrt 2$ and scale the Gaussian distribution to get $k(x,y) = \langle \Phi(x), \Phi(y)\rangle$. This last rescaling occurs because the $L^2$ norm of a normal distribution is not $1$ in general.

Douglas Zare
  • 10,278
  • 2
  • 38
  • 46
  • 2
    @Zen, Douglas Zare: thank you for your great answers. How am I supposed to select the official answer now? – Leo Sep 04 '12 at 22:03
27

I will use method 1. Check Douglas Zare's answer for a proof using method 2.

I will prove the case when $x,y$ are real numbers, so $k(x,y)=\exp(-(x-y)^2/2\sigma^2)$. The general case follows mutatis mutandis from the same argument, and is worth doing.

Without loss of generality, suppose that $\sigma^2=1$.

Write $k(x,y)=h(x-y)$, where $$h(t)=\exp\left(-\frac{t^2}{2}\right)=\mathrm{E}\left[e^{itZ}\right] $$ is the characteristic function of a random variable $Z$ with $N(0,1)$ distribution.

For real numbers $x_1,\dots,x_n$ and $a_1,\dots,a_n$, we have $$ \sum_{j,k=1}^n a_j\,a_k\,h(x_j-x_k) = \sum_{j,k=1}^n a_j\,a_k\,\mathrm{E} \left[ e^{i(x_j-x_k)Z}\right] = \mathrm{E} \left[ \sum_{j,k=1}^n a_j\,e^{i x_j Z}\,a_k\,e^{-i x_k Z}\right] = \mathrm{E}\left[ \left| \sum_{j=1}^n a_j\,e^{i x_j Z}\right|^2\right] \geq 0 \, , $$ which entails that $k$ is a positive semidefinite function, aka a kernel.

To understand this result in greater generality, check out Bochner's Theorem: http://en.wikipedia.org/wiki/Positive-definite_function

Zen
  • 21,786
  • 3
  • 72
  • 114
  • 2
    This is a good start, in the right direction, with two caveats: (a) $h(t)$ is not equal to the expectation shown (check the sign in the exponent) and (b) this appears to restrict attention to the case where $x$ and $y$ are scalars and not vectors. I've upvoted in the meantime, because the exposition is nice and clean and I'm sure you'll quickly plug these small gaps. :-) – cardinal Sep 03 '12 at 23:21
  • 1
    Tks! I'm in a hurry here. :-) – Zen Sep 03 '12 at 23:42
  • 1
    Excuse me, I really don't see how you manage the mutatis mutandis here. If you develop the norm before passing to the $h$ form, then you got products and you can't swap products and sum. And I simply don't see how to develop the norm after passing to the h form to obtain a nice expression. Can you lead me a bit there ? :) – Alburkerk Dec 05 '17 at 21:23
  • How did $e^{-ix_kZ}$ get rid of the minus from first to second line of the expression? – Slim Shady Oct 25 '21 at 16:31