1

It is easier to optimise a SVM without an offset term $b$ since we get rid of one of the linear constraints in the dual formulation. If an offset is still needed, it has been suggested that the feature vector $x\in\mathbf{R}^n$ is augmented by a constant term. Let's call this term $d$, then $x' = [x; d]\in\mathbf{R}^{n+1}$.

Suykens et al 2014 (page 9 in Regularization, Optimization, Kernels, and Support Vector Machines) write that

As a side-effect, the offset $d^2$ is then also regularized in the new term $||w||^2$. Nevertheless, if desired, the effect of this additional regularization can be made arbitrarily weak by re-scaling the fixed additional feature value from one to a larger value.

A similar argument was made here by Sobi in a different thread

However, if we increase $d$ (to make its effect on regularization smaller) we make the effect on the kernel matrix larger. E.g. for a linear kernel and another augmented sample $y' = [y; d]$ we get

$$ K(x',y') = x'^\top y' = x^\top y + d^2 = K(x,y) + d^2 $$

Is this effect on the kernel matrix problematic? Should we hence not make $d$ too large?

appletree
  • 167
  • 4
  • My own question [Should we account for the intercept term when kernelizing algorithms?](https://stats.stackexchange.com/questions/232562/should-we-account-for-the-intercept-term-when-kernelizing-algorithms) might shed some light onto this. – Firebug Oct 23 '17 at 18:16
  • It does partially, though my question is a bit different: I know I want to include a bias term, but I want to use the 'trick' of including it in $x$ to ease optimisation. So does this 'trick' lead me astray when I chose a large $d$ or is it ok? – appletree Oct 23 '17 at 19:21
  • If $d$ is really large, it's coefficient will be small (it's a tradeoff), therefore it won't be severely penalized. The converse is also true, if you make $d$ small it's associated coefficient will be huge, and more prone to penalization. – Firebug Oct 25 '17 at 15:47

0 Answers0