8

Typical support vector classifier uses the following optimization procedure:

$$\min \dfrac{1}{2}||w||^2 + C\sum_{i=1}^N \zeta_i$$ $$y_i(w^Tx_i+b) \geq 1 - \zeta_i$$ $$\zeta_i \geq 0$$

This hinge loss setup slightly penalizes the correctly classified data points within the margin. Now if we gently modify the constraint the result will be a learning machinery with a regularized perceptron loss and it will only penalize misclassified data points.

$$y_i(w^Tx_i+b) \geq - \zeta_i$$

enter image description here

I understand there are historical reasons things are the way they are (maximizing margin rhetoric, etc.) But is there a particular theoretical reason for not implementing a support vector classifier in this manner? What will be the pros and cons?

Germania
  • 224
  • 10
Cagdas Ozgenc
  • 3,716
  • 2
  • 29
  • 55

1 Answers1

5

Maximizing the margin is not just "rhetoric". It is the essential feature of support vector machines and ensures that the trained classifier has the optimal generalization properties. More precisely, making the margin large maximizes the probability that the classification error on new data will be small. The theory behind it is called the Vapnik-Chervonenkis (VC) Theory.

In your question you consider a soft-margin classifier, but the reasoning is equally valid for hard-margin classifiers, which work only on linearly separable datasets. In that case, all your $\zeta_i$'s would simply turn out $0$, thus minimizing the sum in your objective function. Therefore, for simplicity, I'll reformulate your argument for linearly separable data.

Training a support vector machine amounts to optimizing:

$$\min ~ \lVert w \rVert ^2 \\ \text{s.t.} ~ ~ ~ y_i (w^T x_i + b) \ge 1$$

We want to minimize $\lVert w \rVert ^2$ because that maximizes the margin $\gamma$:

$$\gamma = \frac{1}{\lVert w \rVert}$$

The constraints ensure not only that all points from the training set are correctly classified, but also that the margin is maximized. As long as there are points for which $y_i (w^T x_i + b) \lt 1$, the training continues by adjusting $w$ and $b$.

Now, you suggest that we can use different constraints:

$$\min ~ \lVert w \rVert ^2 \\ \text{s.t.} ~ ~ ~ y_i (w^T x_i + b) \ge 0$$

The solution to this problem is trivial: Simply set $w=0$ and $b=0$, and $\lVert w \rVert ^2$ will be zero, too. However, this doesn't give you any information about the class boundary: $w$, being now a null vector, doesn't define any hyperplane at all!

Or, to have a look from a different perspective:

Imagine having performed the training using some different learning algorithm, let's say, the perceptron algorithm. You have found some $w$ and $b$ which result in perfect classification on your training set. In other words, your constraint $y_i (w^T x_i + b) \ge 0$ is satisfied for every $i$. But, is this class boundary realistic?

Take the following example: The class boundary separates the blue from the red points, but almost touches one in each class (i.e. the "perceptron" condition is satisfied, but not the SVM condition):

suboptimal separation

In contrast, the one below has a "large margin", satisfying the SVM condition:

large margin separation

It is intuitively clear that this classifier, having the boundary as far as possible from the training points, has a better chance of good generalization.

Igor F.
  • 6,004
  • 1
  • 16
  • 41
  • Your reasoning about the optimization is valid. Having max margin vs another boundary is in my mind dubious, no matter how visually pleasing or intuitive it is. Can you provide a reference showing that max margin is actually a minimax boundary under all probability distributions for two class scenario? – Cagdas Ozgenc Jan 27 '20 at 18:45
  • See for example Nello Cristianini & John Shawe-Taylor: "Support Vector Machines", Cambridge University Press, 2000, p. 63 (quote of the theorem, originally, I believe, by Vapnik). – Igor F. Jan 27 '20 at 19:30
  • Also, if you take a look at the figures, maybe you can recognize that both lines (and infinitely many more) satisfy the "perceptron" condition: They correctly classify all points. But, are they all equally good class boundaries? – Igor F. Jan 27 '20 at 19:47
  • 1
    I am under the impression that you are "moving the goalposts". Originally, you wanted to know whether there is a theoretical reason for not using the "perceptron" loss function. I answered that. For everything else, I suggest you post a new, separate question. – Igor F. Jan 27 '20 at 20:29
  • You answered the original question I am not challenging your answer. Since you evangelized the max margin, I thought maybe you know a proof for minimax bound. For example two gaussians with equal diagonal covariances placed far apart from each other will be estimated quite poorly by max margin compared to simply taking the mid point of the means. For this reason I wanted to understand at least if there is a minimax defense for it. – Cagdas Ozgenc Jan 27 '20 at 20:34
  • The VC-Theory gives you upper bounds on errors independent of the probability distribution behind the data. Of course, if you know the distribution, you might achieve a lower error using a more specific algorithm. E.g. if you have two Gaussian classes of the same size and variance, you are probably better off with logistic regression. But, in practice, you seldom know the distribution and, even if you did, the assumptions for an alternative known algorithm (in our case: two classes, equal size, equal variances) would probably be violated. – Igor F. Jan 28 '20 at 08:38
  • You did not prove how the last two photos follows from the different inequality constraints involving xi. – Germania Oct 20 '21 at 21:40
  • Your margins are wrong. – Germania Oct 20 '21 at 23:04
  • @Germania You probably mean the objective function, but no, it's not wrong. People use both versions, even in a same publication (see e.g. Vapnik, "The Nature of Statistical Learning", (5.9) vs. (5.10)), and Shalev-Schwartz and Ben-David in "Understanding Machine Learning" formulate it in terms of a general parameter $\lambda$ (15.6). You need to dive a bit into the math to understand why. – Igor F. Oct 21 '21 at 06:51
  • margin=2/||w|| is correct. – Germania Oct 21 '21 at 07:45
  • 1
    @Germania I see what you mean, but, again, it's a matter of definition. Cristianini and Shawe-Taylor in "An Introduction to Support Vector Machines...", p. 95 define the margin as 1/2 the distance between the closest vectors of the different classes, or the distance between the boundary and these vectors. Vapnik's Fig. 5.2 suggests that he uses the same convention, as does Bishop, PRML, Fig. 7.1. But I acknowledge that some researchers use the double value and that it may look more intuitive. For eventual computations in SVM it makes no difference. – Igor F. Oct 21 '21 at 18:37