How to interpret hyperplane plot

Question

In this scikit-learn example, they demonstrate how to plot the hyperplane. Whilst I have some understanding of this I will ask about few basic points:

What are the scattered dots?
Why do we add a +1/-1 margin to the line?
Are the dots on the dashed lines important?
Why is this useful? How is it to be interpreted?

For reference, here is the graph from the link:

A lot of this will be clear automatically if you understand what an SVM is & how it works. You may want to read this: [How does a Support Vector Machine (SVM) work?](https://stats.stackexchange.com/q/23391/) — gung - Reinstate Monica, Feb 16 '18 at 15:57
I had a read through and now I have answers to all points apart from 4). If I have a trained SVM, do I gain anything by making a similar plot for either my training, testing or new data? — turnip, Feb 16 '18 at 17:09
#4 is fine, but it amounts to why people would ever want to plot their data, or their data w/ their model overlaid. An answer to that question in general would presumably cover this instance as well. — gung - Reinstate Monica, Feb 16 '18 at 18:23

giusti · Accepted Answer · 2018-02-16T16:38:40.193

SVM is a transformation-based classifier. It transform your data into a space where it can find a hyperplane that best separates examples (instances) from different classes.

In your graph, each point represents an example. They are scattered according to the values of their features in the space found by SVM (which can be the space of the original data).

The hard line is the optimal separating hyperplane, the dashed lines are the separating margins, and the points on the dashed lines are called support vectors. They are all related:

the hyperplane is the one that best separates the classes with respect to a utility function;
the margins are equally spaced from the hyperplane. They are called +1 and -1 because the utility function for the instances on the margin is exactly +1 or -1;
the support vectors are the "hardest instances" of your problem. They are the ones closer to the optimal hyperplane. You can't find another hyperplane that maximizes your utility function without having more instances over the margin.

When you want to classify new data, the only instances you need are the support vectors. Suppose you want to classify a new instance whose features will be $x=8$ and $y=-8$. The SVM needs only to find the distance from this example to the support vectors and, with this, it knows which side of the hyperplane it falls. Our instance $(8,-8)$ falls in the side of the orange instances, so the SVM will classify the new instance with the "orange class".

Notice that the SVM does not really find the values of the features. Instead, it uses a function called kernel which gives the distance among the instances in the feature space without actually transforming them. The transformation is implicit. This makes possible for the SVM to use very complex spaces.

Also, this looks like an example of a hard margin SVM. The hard margin SVM is induced by solving an optimization problem where all instances from one class fall in one side of the margin, and all instances of the other fall in the opposite side. This is a very hard constraint and, in reality, we use soft margin SVM, which has a cost function that accepts a few instances in the "wrong side" of the margin. This reduces the bias of the model and in turn makes it generalize better to unseen data.

Can you explain the purpose of the separation margins and support vectors please? — turnip, Feb 16 '18 at 15:52

How to interpret hyperplane plot

1 Answers1