Kernel SVM in primal training with Stochastic Gradient Descent

Question

In short: I am currently reading Online Learning with Kernels (http://books.nips.cc/papers/files/nips14/AA33.pdf) for fun and I can't figure out how he got to equation 8 from equations 6 and 7.

The idea is: We want to minimize a risk function $R_{stoch}[f,t]:=c(x_t,y_t,f(x_t))+\lambda\Omega[f]$. If we want apply the representer theorem on f, writing it as $f(x)=\sum\alpha_i k(x,x_i)$, how can we get to the STOCHASTIC gradient descent update? Say we take the soft margin loss for SVMs. It would be easy to take the gradient w.r.t. to f and loss (well sub-gradient for loss) and do gradient descent. But for online learning with stochastic gradient descent, I'm kinda lost.

Thank you! Please do not hesitate to ask further details. Any help would be greatly appreciated.

Cross-posted at http://stackoverflow.com/questions/14007771/kernel-svm-primal-with-stochastic-gradient-descent — MattBagg, Dec 24 '12 at 04:46

Firebug · Answer 1 · 2018-03-26T15:12:10.593

Pardon my matrix calculus, been some time since I last used it.

From my answer to Is Gradient Descent possible for kernelized SVMs (if so, why do people use Quadratic Programming)?, we can write the primal SVM (Hinge-loss with squared-$\ell_2$ regularization) objective as:

$$J(\mathbf{u}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y_i (\mathbf{u}^t \cdot \mathbf{K}_i + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

We want to minimize this quantity. Deriving it with regards to $\mathbf{u}$ results in:

$$\frac{\partial J(\mathbf{u}, b)}{\partial \mathbf u} = C {\displaystyle \sum\limits_{i=1}^{m} \left [y_i (\mathbf{u}^t \cdot \mathbf{K}_i + b) \lt 1 \right ] \cdot (- y_i \cdot\mathbf{K}_i}) + \mathbf{K}\cdot\mathbf{u}$$

The derivative regarding $b$ results in:

$$\frac{\partial J(\mathbf{u}, b)}{\partial b} = C {\displaystyle \sum\limits_{i=1}^{m} \left [y_i (\mathbf{u}^t \cdot \mathbf{K}_i + b) \lt 1 \right ] \cdot\left(- y_i \cdot b\right)}$$

Where $\left[\cdot\right]$ is the Iverson bracket. Notice that the derivative is undefined when $y_i (\mathbf{u}^t \cdot \mathbf{K}_i + b) \equiv 1$, and in keeping with SGD tradition a reasonable value can be assigned to the step, or the sample can be skipped.

From these equations, you can derive SGD gradients.

Filippo Portera · Answer 2 · 2021-03-10T11:30:47.710

We propose the use of the repesenter Theorem that, for SVM, states:

$f(\vec{x}_i)= \sum_{j=1}^l \alpha_j K(\vec{x}_i, \vec{x}_j) + b$

where $\alpha_j \in \mathbb{R}$.This way the Primal object function becomes:

$P = \frac{1}{2} \sum_{i=1}^l \sum_{j=1}^l \alpha_i \alpha_j K(\vec{x}_i, \vec{x}_j) + C \sum_{i=1| \xi_i > 0}^l (1 - y_i f(\vec{x}_i))$

and the gradient is:

$\frac{\partial{P}}{\partial{\alpha_r}} = \sum_{i=1}^l \alpha_i K(\vec{x}_i, \vec{x}_r) - C \sum_{i=1| \xi_i > 0}^l y_i K(\vec{x}_i, \vec{x}_r)$

$b$ can be calculated averanging for the corresponding values where $\xi_i = 0, i \in {1..l}$

Since we always use Gaussian kernels, the second derivative is always $K(\vec{x}_r, \vec{x}_r) = 1 $ .

So, the proposed update rule is:

$ \alpha_r^{n+1} \leftarrow \alpha_r^{n} - \frac{\partial{P}}{\partial{\alpha_r}} $

We stop the iteration $n$ whenever the Primal objective function doesn't improve within a given number of iterations (100) for our small UCI datasets.

Perhaps this could be enhanced with the Adam optimizer.

Kernel SVM in primal training with Stochastic Gradient Descent

2 Answers2