2

I've saw the following definition of hinge loss, in the case of multiclass classification, using a delta term.

$$ L({W}) = \frac{1}{N} \sum_{i=1}^{N} L_{i}({W}) + \frac{\lambda}{2} ||{{W}}||^2 $$ $$ L_{i}({W}) = \sum_{j \neq y_i} \max\left(0, \Delta+ {w_j} \vec{x_i} - {w_{y_i}} \vec{x_i}\right), $$

As I see, this can be understood as attempting to make sure that the score for the correct class is higher than the other classes by at least some margin $\Delta > 0$.

My question is does delta matter?

I mean, I think that the bigger the delta, the more difficult it will be for the classifier to find a good separation of the space, the bigger the loss will be from observations otherwise already ignored, and the longer the training. I ran some training on the MNIST data using an SVM a few times with larger and larger deltas, and the (test) accuracy kept going down as the loss went up.

Yet I usually see that delta is being set to 1, and no one actually runs a hyperparameter search for it. I wonder if it's because it's some how related to the regularization constant lambda. And if so, can someone explain the connection?

Maverick Meerkat
  • 2,147
  • 14
  • 27

1 Answers1

0

I think perhaps they are connected by the fact that $\Delta$ is the margin. So in the binary case, we want a margin of $\frac{2\Delta}{||w||}$. In this case increasing the $\Delta$ is exactly the same as minimizing the $||w|||$, which is achieved by increasing the regularization parameter $\lambda$.

Maverick Meerkat
  • 2,147
  • 14
  • 27