1

I am just learning about Mercer Kernels, and a question came up. Since using Mercer's theorem, we know that a positive definite kernel matrix can be represented by an inner production of the input vector mapped to new feature space implied by the kernel.

A Gram matrix of $X$ is defined a $K(X;k)\in \mathbb{R}^{m\times m}$ such that $K_{i,j}=k(\hat{x}_i,\hat{x}_j)$. If the matrix $K$ is positive definite, then $k$ is called a Mercer Kernel. By Mercer's Theorem, if we have a Mercer kernel, then there exists a function $\phi: X \to Y $ such that $$k(\hat{x}_i,\hat{x}_j)=\langle \phi(\hat{x}_i),\phi(\hat{x}_j) \rangle $$ The question is, since this is the case, why do we need to use the kernel function at all? Why not just transform the data according to $\phi$ and use the transformed features to train the SVM. Apparently with this approach there should be some difficulty while classifying a new datapoint, but I am not quite finding the issue.

Thanks!

  • 2
    I'm not totally sure what you're asking, but I will point out that the value of the kernel trick is that you *don't* need to use the $\phi$ function at all, instead you can work directly with the kernel function. For example, a kernel for a polynomial basis of degree $d$ can be written as $(x^Ty+1)^d$, far simpler to compute than computing $\phi$, which for a three-element $x$ and $d=2$ results in a nine-element $\phi(x)$, which then has to be multiplied by a nine-element $\phi(y)$... and that's a very small example. – jbowman Nov 08 '19 at 20:52

1 Answers1

4

Say our data lives in $\mathbb R$ and we're using the kernel $$k(x, y) = 1 + 2 x y + x^2 y^2,$$ which corresponds to $$\phi(x) = \begin{bmatrix}1 \\ \sqrt 2 x \\ x^2\end{bmatrix}.$$

If we train a linear SVM on the one-dimensional $\phi(x)$ data, or a kernel SVM on the three-dimensional $x$ data, we'll get the same prediction rule for new data out. So, in this case, the kernel trick is "unnecessary."

But say our data lives in $\mathbb R^d$ and we want to use the kernel $$k(x, y) = 1 + 2 x^T y + (x^T y)^2 = (x^T y + 1)^2.$$ Then the corresponding features end up being of dimension $\frac12 d^2 + \frac32 d + 1$. This is suddenly a much larger model, that will take more computation to solve, more memory to store, etc.

Even worse, say we want to use the kernel $$ k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right) .$$ The $\phi$ features here end up being infinite-dimensional (see here), which means it will take an infinite amount of time and memory to compute with directly. But using the kernel trick, we can do it just fine.

It can also be much easier to choose functions $k$ with certain properties than it is to design feature functions $\phi$. (Kernels are no harder to design than feature functions – you can just write them as $\phi(x)^T \phi(y)$ directly, after all – but the opposite direction can be quite difficult.)

Danica
  • 21,852
  • 1
  • 59
  • 115
  • This is basically the same as [@jbowman's comment](https://stats.stackexchange.com/questions/435230/is-the-kernel-trick-unnecessary-for-for-non-linear-svm/435237#comment811400_435230), but I was already writing it when the comment appeared, so... :) – Danica Nov 08 '19 at 20:58
  • But your answer (+1) is much better and more comprehensive than my comment! – jbowman Nov 08 '19 at 21:00
  • I see what you're saying. The motivation is all to do with efficiency and obviously for the infinite dimensional case I understand. But apparently there may also be a more important problem with trying to classify a new data point without using the kernel, and the direct feature map. Although to me I don't quite see it. The decision function as an inner product of the new data point with its support vectors. It still seems to me that all this work directly computing the feature map. Sorry not sure if I am being very clear. – jeffery_the_wind Nov 09 '19 at 00:14
  • 1
    If you know an explicit $\phi$, there is no issue with new data points; you just use $\phi$ of them. What you might be referring to is what happens if you _don't_ know $\phi$ but are trying to work with a known $n \times n$ kernel matrix $\mathbf K$ in a linear way. In that case, you can construct a $\phi$ valid for _only those $n$ points_ by taking a Cholesky decomposition, $\mathbf K = \boldsymbol\Phi \boldsymbol\Phi^T$; then you can use the $i$the row of $\boldsymbol\Phi$ as $\phi(x_i)$, but won't be able to extend that to new data points. – Danica Nov 09 '19 at 00:26
  • 1
    Because these features are in $\mathbb R^n$, though, you generally don't gain anything computationally with this approach versus just working with the $n\times n$ kernel matrix in the first place (and you lose a lot); it's only in fairly limited circumstances where this Cholesky trick is useful. – Danica Nov 09 '19 at 00:28
  • Yes the point to the question is that we start with a Mercer kernel, and find the feature maps using the the Cholesky factorization ( although I saw it as an eigenvalue decomposition ). And then using these vectors to train the model. You stated that you won't be able to extend this to new data points, I guess that is the piece there. Is it because in this case we only reveal the $\phi(x)$ but don't actually reveal the function $\phi$? So we still wouldn't know exactly how to map a $x$ to $\phi(x)$? – jeffery_the_wind Nov 09 '19 at 00:51
  • 1
    @jeffery_the_wind When we do $\mathbf K = \boldsymbol\Phi \boldsymbol\Phi^T$, we've found points in $\mathbb R^n$ whose inner products agree with $k$ on the $n$ input points; this is the feature embedding for a kernel $$k_X(x, y) = \begin{cases} k(x_i, x_j) & \text{if } x = x_i, y = y_j \\ 0 & \text{otherwise}\end{cases}.$$ But we _haven't_ found the value of a valid feature embedding for $k$ even for those $n$ points: that should take values in an infinite-dimensional $\mathcal H$, not $\mathbb R^n$. In particular, the value of $\phi(x)$ shouldn't depend on the points you're comparing it to. – Danica Nov 09 '19 at 01:12
  • Also interesting about the Cholesky factorization, the $\Phi$ matrix is upper triangular so the intrinsic dimensionality of of the derived feature vectors will actually be less than $n$. – jeffery_the_wind Nov 09 '19 at 03:20
  • 1
    @jeffery_the_wind They're sparse, but (for a strictly positive definite kernel like the Gaussian) they still need to be $n$ dimensional; one point needs to use all $n$ dimensions. – Danica Nov 09 '19 at 03:53