Understand the reasons of using Kernel method in SVM

Question

I understand that one can use kernel functions (i.e. radial kernel) to create non-linear decision boundary.

However, there is something with my logic and I am sure there is something that I clearly misunderstood:

I understand that Kernel functions operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates.

Here is where my logic went wrong:

So, I believe $K: \mathbb{R}^N \rightarrow \mathbb{R}$, i.e. it maps input from high-dimensional space to 1-dimensional space. (Not sure if this is correct. )

However, I watched Andrew Ng's lecture videos on SVM, he mentioned that Kernel can also take original data in $\mathbb{R}^1$ and maps it to very high-dimensional feature space $\mathbb{R}^N$.

This becomes a contradiction and is very confusing.

Please correct my misunderstanding. Thanks.

Can you put the link of the lecture you mentioned as reference? — Jundiaius, Jun 02 '14 at 08:46

score 3 · Answer 1 · answered Jun 02 '14 at 10:03

I think you're confusing two different mappings.

We have data, consisting of a set of samples $x_1, x_2, \ldots$. Our data inhabits the space $\mathbb{R}^{N_{data}}$. So for each sample,

$x_i \in \mathbb{R}^{N_{data}}$.

Imagine mapping this data to a new space. We'll use the function $\Phi$ for this. $\Phi$ maps to a much higher dimensional space $\mathbb{R}^{N_{kernel}}$ where $N_{kernel} \gg N_{data}$.

$\Phi: \mathbb{R}^{N_{data}} \rightarrow \mathbb{R}^{N_{kernel}}$.

What we could do now is just explicitly calculate the co-ordinates of our mapped samples, i.e. find $\Phi(x_1), \Phi(x_2), \ldots$ However, as you point out, if we can cast our problem purely in terms of dot products between samples then we can skip this. That's where the kernel function $K$ comes in. We just use it to replace all our dot products.

$K(x_i,x_j) \equiv \Phi(x_i).\Phi(x_j)$ .

As this is a dot product, its output is just a real number, so it's in $\mathbb{R}$. As such, I think $K$ is written as this (I'm sure if I'm wrong someone will correct me):

$K: \mathbb{R}^{N_{data}} \times \mathbb{R}^{N_{data}} \rightarrow \mathbb{R}$

So the kernel function maps from pairs of samples in $\mathbb{R}^{N_{data}}$ to $\mathbb{R}$. That's a mapping to a lower dimensional space. However, it does this in a way which implicitly maps each sample to the much higher dimensional $\mathbb{R}^{N_{kernel}}$ first using $\Phi$, and then finds the dot product between the samples in this new space. The latter is the mapping Andrew Ng is talking about.

So , how would you call the function $\Phi$ ? Is it called kernel ? — mynameisJEFF, Jun 02 '14 at 10:43
Nope. $K$ is the kernel. $\Phi$ doesn't have a specific name I'm aware of. Bishop calls it a 'fixed feature-space transform'. — Pat, Jun 02 '14 at 10:50
@mynameisJEFF: Here you find a more 'intuitive' explanation that the $\Phi$ can map onto a spave of dimension equal to the size of the training sample, even for an infinite training sample: http://stats.stackexchange.com/questions/80398/svm-in%EF%AC%81nite-dimensional-feature-space/168309#168309 — , Aug 22 '15 at 08:42

Understand the reasons of using Kernel method in SVM

1 Answers1