8

How can I prove that linear combination of two kernel functions is also a kernel function?

\begin{align} k_{p}( x, y) = a_1k_1( x, y) + a_2k_2(x,y) \end{align}

given $k_1(,)$ and $k_2(,)$ are valid kernel functions.

In general to prove any such results involving dot product , cascading.. etc. , what methodology can be followd to prove RHS is a kernel function given k's in LHS all are kernel?

tusharfloyd
  • 353
  • 3
  • 8
  • 1
    Only linear combinations with positive coefficients are always valid. – Marc Claesen Oct 15 '15 at 17:34
  • 5
    @MarcClaesen Just to clarify a minor point: only positive linear combinations are *necessarily* valid. It's possible for one with some negative coefficients to still be psd; this arises in multiple kernel learning problems, where the optimization becomes a semidefinite program. – Danica Oct 15 '15 at 17:36
  • @Dougal point taken! – Marc Claesen Oct 15 '15 at 17:37
  • I recommend these lecture notes http://alex.smola.org/teaching/cmu2013-10-701x/slides/recitation_3_kernel_properties_convexity.pdf and the book http://www.kernel-methods.net/ for more detailed explanations. – JStrahl Sep 20 '16 at 12:59

2 Answers2

10

A necessary and sufficient condition for a function $\kappa(\cdot,\cdot)$ to be expressible as an inner product in some feature space $\mathcal{F}$ is a weak form of Mercer's condition, namely that:

$$ \int_\mathbf{x} \int_\mathbf{y} \kappa(\mathbf{x},\mathbf{y})g(\mathbf{x})g(\mathbf{y})d\mathbf{x}d\mathbf{y} \geq 0, $$ for all square, integrable functions $g(\cdot)$ [1,2].

In your case, this reduces to the following: $$ \begin{align} &\int_\mathbf{x} \int_\mathbf{y} \big(a_1\kappa_1(\mathbf{x},\mathbf{y}) + a_2 \kappa_2(\mathbf{x},\mathbf{y})\big)g(\mathbf{x})g(\mathbf{y})d\mathbf{x}d\mathbf{y} \\ &= a_1 \underbrace{\int_\mathbf{x} \int_\mathbf{y} \kappa_1(\mathbf{x},\mathbf{y})g(\mathbf{x})g(\mathbf{y})d\mathbf{x}d\mathbf{y}}_{\geq 0} + a_2 \underbrace{\int_\mathbf{x} \int_\mathbf{y}\kappa_2(\mathbf{x},\mathbf{y})g(\mathbf{x})g(\mathbf{y})d\mathbf{x}d\mathbf{y}}_{\geq 0} \geq 0. \end{align} $$ Since $\kappa_1(\cdot,\cdot)$ and $\kappa_2(\cdot,\cdot)$ are given to be kernel functions, their integrals both satisfy Mercer's condition. Finally, if $a_1 \geq 0$ and $a_2 \geq 0$, then the overall integral is guaranteed to satisfy it too. $\blacksquare$

Note that, as @Dougal correctly pointed out, it is still possible to get a valid kernel function with negative $a_1$ or $a_2$ (not both), but that depends on several factors.

[1] Vladimir N. Vapnik. Statistical learning theory. Wiley, 1 edition, September 1998.

[2] Richard Courant and David Hilbert. Methods of Mathematical Physics, volume 1. Interscience Publishers, Inc., New York, NY, 1953

Marc Claesen
  • 17,399
  • 1
  • 49
  • 70
7

As an alternative approach to Marc's:

A symmetric function $k : \mathcal X \times \mathcal X \to \mathbb R$ is a kernel function iff there is some "feature map" $\varphi : \mathcal X \to \mathcal H$ such that $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$, where $\mathcal H$ is a Hilbert space.

Let $\varphi_i$ be the feature map for $k_i$, and $\mathcal H_i$ its Hilbert space.

Now, $\mathcal H_p := \mathcal H_1 \oplus \mathcal H_2$ is a Hilbert space, and $\varphi_p := \sqrt{a_1} \varphi_1 \oplus \sqrt{a_2} \varphi_2$ is a feature map from $\mathcal X$ to it as long as $a_1, a_2 \ge 0$. (For finite-dimensional feature spaces, this is just concatenating the feature maps together.) Note that \begin{align} \langle \varphi_p(x), \varphi_p(y) \rangle_{\mathcal H_p} &= \langle \sqrt{a_1} \varphi_1(x) \oplus \sqrt{a_2} \varphi_2(x), \sqrt{a_1} \varphi_1(y) \oplus \sqrt{a_2} \varphi_2(x) \rangle_{\mathcal H_1 \oplus \mathcal H_2} \\&= a_1 \langle \varphi_1(x), \varphi_1(y) \rangle_{\mathcal H_1} + a_2 \langle \varphi_2(x), \varphi_2(y) \rangle_{\mathcal H_2} \\&= k_p(x, y) ,\end{align} so $k_p$ has feature map $\varphi_p$, and is therefore a valid kernel.


To your "in general" question at the end: if you want to prove them for arbitrary kernels, the two main techniques are the one Marc used and the one I used. Often, though, for Marc's approach we directly use the Mercer condition rather than the integral form, which can be easier to reason about:

A symmetric function $k : \mathcal X \times \mathcal X \to \mathcal R$ is positive semidefinite if and only if for all $M$, all $x_1, \dots, x_M \in \mathcal X$, and all $c_1, \dots, c_M \in \mathbb R$, $\sum_{i=1}^M \sum_{j=1}^M c_i k(x_i, x_j) c_j \ge 0$.

We can also use the following equivalent form:

A symmetric function $k : \mathcal X \times \mathcal X \to \mathcal R$ is positive semidefinite if and only if for all $M$, all $x_1, \dots, x_M \in \mathcal X$, the matrix $K$ with $K_{ij} = k(x_i, x_j)$ is positive semidefinite.

I previously gave brief proofs for several such properties in this answer.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • Does having a feature mapping of form $a(x)^Ta(x)$ a sufficient condition for it to be valid kernel? – tusharfloyd Oct 15 '15 at 20:08
  • 1
    @tusharfloyd Yes; the properties of inner products guarantee Mercer's condition. – Danica Oct 15 '15 at 20:11
  • Is it true because an inner product k(i,j) will always produce a positive semi-definite K (Gram Matrix) – tusharfloyd Oct 15 '15 at 20:15
  • @tusharfloyd Yep, exactly. – Danica Oct 15 '15 at 20:15
  • thanks. Just one last question I have, Isn't symmetry also a required condition for a valid kernel. Or is it trivial to prove so it is ignored. Can you direct me to some link or reference where I can find proof or validation for my second comment on why inner product gives a positive semi definite. I am still not able to wrap my head around all these axioms – tusharfloyd Oct 15 '15 at 20:28
  • Whoops, you're right about the symmetry; that's implied by the complex version of this condition, but not the real one (see e.g. [this discussion](https://en.m.wikipedia.org/wiki/Positive-definite_matrix#Consistency_between_real_and_complex_definitions)). Will fix. – Danica Oct 15 '15 at 20:35
  • 2
    For inner product being psd: $\sum_{i,j} c_i \langle x_i, x_j \rangle c_j = \langle \sum_i c_i x_i, \sum_i c_i x_i \rangle \ge 0$ by the properties of inner products. – Danica Oct 15 '15 at 20:38
  • My understanding of Mercer's conditions for a function to be a kernel is that (1) the kernel function is symmetric and (2) the kernel matrix is p.s.d. If $k_1$ and $k_2$ are p.s.d., isn't it sufficient to observe that the sum of two p.s.d. matrices is itself p.s.d.? – Sycorax Oct 22 '15 at 19:09
  • @user777 Yep, that's effectively using the last condition I cite here, and amounts to the same thing. – Danica Oct 22 '15 at 19:11