SVM: (Using the RBF as a Kernel) vs (Using the RBF to create a new set of features)

Question

I apologize for the verbose description, but (after searching several places for an answer) maybe the best way to phrase it is to lay things out explicitly.

Say we are trying to build an SVM model for classification, and our data consists of $m$=1000 training examples with $n$=15 features.

We start with the 'Primal' version of the optimization problem, which is defined in terms of the values of the feature-vectors $x_i$.

The Lagrangian dual of the optimization problem is (from Wikipedia):

${\text{maximize}}:\ \ \ \sum _{i=1}^{m}c_{i}-{\frac {1}{2}}\sum _{i=1}^{m}\sum _{j=1}^{m}y_{i}y_{j}c_{i}c_{j}(x_{i}\cdot x_{j})\ ,$ ${\text{subject to }}:\ \ \sum _{i=1}^{m}c_{i}y_{i}=0,\,{\text{and }}0\leq c_{i}\leq {\frac {1}{2m\lambda }}\;{\text{for all }}i.$

Notably, this is defined only in terms of the pairwise dot-products of the $x_i$ vectors. This helps us use the kernel-trick by replacing $(x_i \cdot x_j$) in the optimization objective with $K(x_i,x_j)$, where $K$ is our choice of kernel function. Let's say we choose the RBF: $K(u,v)=\exp \left(-{\frac {\|u -v \|^2}{2\sigma ^2}}\right)$, and this brings us to:

APPROACH (A)
${\text{maximize}}:\ \ \ \sum _{i=1}^{m}c_{i}-{\frac {1}{2}}\sum _{i=1}^{m}\sum _{j=1}^{m}y_{i}y_{j}c_{i}c_{j}\ K(x_{i}\cdot x_{j})$
This approach works as if we had implicitly mapped all vectors $x_i \in \mathbb{R}^{15}$ to some higher-dimensional feature-set $\phi_i$(in this case, $\phi_i \in \mathbb{R}^\infty$ due to using the RBF), and found an optimal hyperplane to separate the points.

Now consider APPROACH (B), where we use the RBF in a different way:
We map the features $x_i$ to $\mathbb{R}^{m}$, by evaluating for each point its 'nearness' to every $x_i$ using the RBF.
i.e. Each $x_i \in \mathbb{R}^{15}$ is mapped to a $\phi_i \in \mathbb{R}^{1000}$,
where the value of the $d^{th}$ dimension of $\phi_i$ is $\text{RBF}(x_i,x_d)$.
This is not very different from using polynomial-features followed by linear regression.

In both approaches we are using the RBF as a measure of 'similarity'.
My questions are:

Is Approach (B) used in practice? Is there anything wrong with using the RBF to compute new features in this way, or does it make it inferior to approach (A)?
Intuitively, how is the handling of the pairwise 'similarity' of training points different in the two approaches? Why do 'similarity-functions' work well as choices for kernel functions in the first place?

I think you will find something similar here : http://stats.stackexchange.com/questions/168051/kernel-svm-i-want-an-intuitive-understanding-of-mapping-to-a-higher-dimensional/168082#168082 — , Jan 29 '17 at 09:20
That thread has some very interesting discussion on kernel methods. However, it doesn't quite touch on the differences between the specific approaches I mention here. Specifically, approach B, in which we map our feature set explicitly using the RBF (as opposed to A, in which we **use** the RBF but are implicitly mapping to some unknown set of features) — dhrumeel, Jan 30 '17 at 17:46
Well the $\phi$ in the link is defined in the same way as you define it and the thread shows that both approaches are equivalent (which is exactly what is the idea behind the reproducing kernel hilbert spaces) ? So approach B does the same as A except that it makes the $\phi$ explicit ? — , Jan 30 '17 at 19:44
It does not seem like the 2 approaches are equivalent. In (A), the mapped-to feature space is (implicitly) $\mathbb{R}^\infty$ while in (B) we are explicitly defining features that lie in $\mathbb{R}^{1000}$. — dhrumeel, Jan 30 '17 at 23:49
Well in B you define a feature space that is $R^n$ where n is your sample size, as this can be done for samples of any size n, this is $R^\infty$. — , Jan 31 '17 at 04:51
But according to [this answer](http://stats.stackexchange.com/a/69767/146836), the implicit feature-map for the RBF kernel is quite different from the pairwise-RBF features of approach (B) — dhrumeel, Jan 31 '17 at 21:42
Well the idea behind kernels is the for some kernels there exists a $\phi$ such that $K(x,y)=$. The fact that this holds has been proven by mathematicians in the theory on reproducing kernel Hilbert spaces. If you study how they prove it then you will see that they show the existence of such $\phi$ by constructing it. More: you will see that they construct it in the same way as in your approach B. — , Feb 01 '17 at 04:42

SVM: (Using the RBF as a Kernel) vs (Using the RBF to create a new set of features)

0 Answers0