2

On slide 32/85 of this tutorial on structured SVM learning, the author formulates binary SVM classification in terms of structured output SVM by defining $\Psi(x, y) = \frac{y}{2}x$.

Why is there a constant factor of $1/2$ in the feature map and inference? I can't see it making any difference to the result.

Sobi
  • 2,141
  • 11
  • 22
jogloran
  • 123
  • 5

1 Answers1

1

The $1/2$ factor is there so that the loss functions match. Below I will describe the binary and multi-class SVM training objective and will show how setting $\Psi(x, y) = \frac{y}{2}x$ makes the equivalent.

Standard (binary) SVM formulation

(@ slide 16/85 of the tutorial) $$ E(w) = \frac{\lambda}{2} ||w||^2 + \frac{1}{n} \sum_{i=1}^n \max \left\{ 0, 1 - y_i w^T x_i \right\} \tag{1} $$

Structured SVM formulation

(@ slide 42&44/85 of the tutorial) $$ E(w) = \frac{\lambda}{2} ||w||^2 + \frac{1}{n} \sum_{i=1}^n \max_{y \in \mathcal{Y}} \left( \Delta(y_i, y) + w^T \Psi(x_i, y) - w^T \Psi(x_i, y_i) \right) \tag{2} $$ where $\Delta(y_i, y) = \mathbf{1}\{ y \neq y_i \}$.

Now let's see what happens when we set $\Psi(x, y) = \frac{y}{2}x$, as suggested in slide $32$. Note that in this case $\mathcal{Y}=\{+1, -1\}$, and we can rewrite the term inside the summation in Equation $(2)$ the maximum between the following two quantities: $$ \begin{cases} 0 + w^T \left(\frac{y}{2} x_i\right) - w^T \left(\frac{y_i}{2} x_i\right) = 0 & \text{ when } y = y_i \\ 1 + w^T \left(\frac{y}{2} x_i\right) - w^T \left(\frac{y_i}{2} x_i\right) = 1 - y_i w^T x_i & \text{ when } y \neq y_i \end{cases} $$ which is the same as $(1)$.

Sobi
  • 2,141
  • 11
  • 22