Intuition for Support Vector Machines and the hyperplane

Question

In my project I want to create a logistic regression model for predicting binary classification (1 or 0).

I have 15 variables, 2 of which are categorical, while the rest are a mixture of continuous and discrete variables.

In order to fit a logistic regression model I have been advised to check for linear separability using either SVM, perceptron or linear programming. This ties in with suggestions made here regarding testing for linear separability.

As a newbie to machine learning I understand the basic concepts about the algorithms mentioned above but conceptually I struggle to visualise how we can separate data that has so many dimensions i.e 15 in my case.

All the examples in the online material typically show a 2D plot of two numerical variables (height,weight) which show a clear gap between categories and makes it easier to understand but in the real world data is usually of a much higher dimension. I keep being drawn back to the Iris dataset and trying to fit a hyperplane through the three species and how it's particularly difficult if not impossible to do so between two of the species, the two classes escape me right now.

How does one achieve this when we have even higher orders of dimensions, is it assumed that when we exceed a certain number of features that we use kernels to map to a higher dimensional space in order to achieve this separability?

Also in order to test for linear separability what is the metric that is used? Is it the accuracy of the SVM model i.e. the accuracy based on the confusion matrix?

Any help in better understanding this topic would be greatly appreciated. Also below is a sample of a plot of two variables in my dataset which shows how overlapping just these two variables are.

you seem to have several distinct questions sprinkled in your post. put them all together in a list or remove the non-essential questions. this attract more people to answer and better answers — Aksakal, Mar 29 '17 at 20:26
generally intuition needs a lot of help from imagination when going from 2D to high dimensional situation, often, intuition breaks down completely. there are many high dimensional versions of low dimensional problems that seem to belong to a whole different world where things work differentlty, think of [Fermat's theorem](https://en.wikipedia.org/wiki/Fermat's_Last_Theorem) — Aksakal, Mar 29 '17 at 20:28

score 14 · Accepted Answer · edited Mar 30 '17 at 08:43

I'm going to try to help you gain some sense of why adding dimensions helps a linear classifier do a better job of separating two classes.

Imagine you have two continuous predictors $X_1$ and $X_2$ and $n=3$, and we're doing a binary classification. This means our data looks something like this:

Now imagine assigning some of the points to class 1 and some to class 2. Note that no matter how we assign classes to points we can always draw a line that perfectly separates the two classes.

But now let's say we add a new point:

Now there are assignments of these points to two classes such that a line cannot perfectly separate them; one such assignment is given by the coloring in the figure (this is an example of an XOR pattern, a very useful one to keep in mind when evaluating classifiers). So this shows us how with $p=2$ variables we can use a linear classifier to perfectly classify any three (non-collinear) points but we cannot in general perfectly classify 4 non-collinear points.

But what happens if we now add another predictor $X_3$?

Here lighter shaded points are closer to the origin. It may be a little hard to see, but now with $p=3$ and $n=4$ we again can perfectly classify any assignment of class labels to these points.

The general result: with $p$ predictors a linear model can perfectly classify any assignment of two classes to $p+1$ points.

The point of all of this is that if we keep $n$ fixed and increase $p$ we increase the number of patterns that we can separate, until we reach the point where we can perfectly classify any assignment of labels. With kernel SVM we implicitly fit a linear classifier in a high dimensional space, so this is why we very rarely have to worry about the existence of a separation.

For a set of possible classifiers $\mathscr F$, if for a sample of $n$ points there exist functions in $\mathscr F$ that can perfectly classify any assignment of labels to these $n$ points, we say that $\mathscr F$ can shatter n points. If $\mathscr F$ is the set of all linear classifiers in $p$ variables then $\mathscr F$ can shatter up to $n=p+1$ points. If $\mathscr F$ is the space of all measurable functions of $p$ variables then it can shatter any number of points. This notion of shattering, which tells us about the complexity of a set of possible classifiers, comes from statistical learning theory and can be used to make statements about the amount of overfitting that a set of classifiers can do. If you're interested in it I highly recommend Luxburg and Schölkopf "Statistical Learning Theory: Models, Concepts, and Results" (2008).

many thanks for your detailed response, it really helped me better understand the idea of multidimensional features and how to separate them intuitively. — TheGoat, Mar 31 '17 at 20:44

score 7 · Answer 2 · answered Mar 29 '17 at 20:16

It's easy to make a mistake when you take your intuition about low dimensional spaces and apply it to high dimensional spaces. Your intuition is exactly backwards in this case. It turns out to be much easier to find a separating hyperplane in the higher dimensional space than it is in the lower space.

Even though when looking at any two pairs of variables, the red and blue distributions are overlapping, when looking at all 15 variables at once it is very possible that they don't overlap at all.

smci · Answer 3 · 2017-03-30T19:50:11.497

You have 15 variables, but not all of them are equally significant for discrimination of your dependent variable (some of them might even be nearly-irrelevant).

Principal Component Analysis (PCA) recomputes a linear basis of those 15 variables, and orders them, in such a way that the first few components typically explain most of the variance. So this allows you reduce a 15-dimensional problem to (say) a 2,3,4, or 5-dimensional problem. Hence it makes plotting more intuitive; typically you can use two or three axes for numeric (or high-cardinality ordinal) variables, then use marker color, shape, and size for three extra dimensions (maybe more if you can combine low-cardinality ordinals). So plotting with the 6 most important PC's should give you a clearer visualization of your decision surface.

Intuition for Support Vector Machines and the hyperplane

3 Answers3

Linked