3

I was reading the Hands on ML book and I'm on the SVM and Logistic Regression chapters. I started looking up more on these algorithms and apparently they are "linear" classifiers i.e the decision boundary is linear (The classifier needs the inputs to be linearly separable.)

Now in the book it is mentioned that since in most of the cases data is not linearly separable, we have to increase the dimensions of the features to make it linearly separable.

But is it always true that there is some transformation to convert every non-linearly separable data set into a linearly separable one? If not, what would be an example of such a data set where this is impossible?

Aastha Jha
  • 95
  • 6
  • 3
    You can always make a dataset linearly separable in a trivial (and unhelpful) way: associate a unique real number to each cluster in the dataset and adjoin that number as a new coordinate to all observations. There are other methods that don't require you to know the classifications at the outset, but they typically require many more dimensions. – whuber Jul 30 '20 at 15:26
  • 1
    Some useful discussion in the context of SVMs: https://stats.stackexchange.com/questions/94003/does-using-a-kernel-function-make-the-data-linearly-separable-if-so-why-using?r=SearchResults&s=2|73.7604 – Sycorax Jul 31 '20 at 17:32

2 Answers2

5

In theory, it is always possible to make any arbitrary dataset linearly separable in higher dimensions. In fact, you ideally only need to add one additional dimension to do so, which is a dimension that represents your true class labels. No matter what the data looks like in the other dimensions, if you have a way to add a dimension that represents the true class values, you can linearly separate on that dimension and perfectly recover the true classes. The only time it's impossible to add a dimension like this is if you have two identical samples with different classes, since there will be no deterministic way to map them to different classes given only the feature data.

Otherwise, this mapping is always possible in theory, but in practice, it's usually difficult to come up with a way to generate that extra dimension of class labels which is generalizable and not overfit. A simple transformation is to look at all your datapoints, and just assign the true class as the value on your new dimension, but this method completely fails to generalize to points not in the original data. It's trivial to overfit the mapping to linearly separate the training data, but it's much more difficult to find a mapping that will accurately separate data you didn't train on.

Nuclear Hoagie
  • 5,553
  • 16
  • 24
  • 1
    @Jason_93 What I've described is basically the whole point of supervised machine learning - to find a mapping from a set of input features to a target feature. If you can find a prediction output that has an AUC of 1 (it perfectly predicts the target), you have effectively discovered an additional dimension which makes the dataset linearly separable. – Nuclear Hoagie Aug 01 '20 at 03:57
  • Hello, I didn't understand what you meant by " make any arbitrary dataset linearly separable ...". From what I've seen before, we talk about linear separability of two sets of data lying in a certain dimensions. But here you're just talking about one dataset, so I asked. – Mathmath Nov 18 '20 at 13:44
  • 1
    @Mathmath I'm talking about any possible (arbitrary) dataset that contains 2 or more classes - the number or type of features doesn't matter. In this original dimensional space, you may or may not be able to linearly separate the classes. But there will always exist a mapping where you can add just one more dimension which will make the classes linearly separable. This is basically the goal of many supervised machine learning methods - take a set of features and produce an output (i.e. a new feature) that discriminates between the classes. – Nuclear Hoagie Nov 18 '20 at 14:38
  • Thanks! Can you be a bit more mathematically specific on "...there will always exist a mapping where you can add just one more dimension...(which will make the classes linearly separable)"? Say we've a data set in $R^d$ consisting of two classes $C_1, C_2 \subset R^d$. Assume that $C_1, C_2$ aren't linearly separable in $R^d.$. Are you saying that: then there exists a map $F: R^d \to R^{d+1}$ so that even if two classes weren't linearly separable in $R^d,$ after taking $F$-images, the new $(d+1)$-dimensional sets $F(C_1), F(C_2) \subset R^{d+1}$ will be linearly separable? Cite the theorem? – Mathmath Nov 18 '20 at 14:45
  • @Mathmath Not sure of the theorem, but the trivial approach is to build a non-generalizable model that simply "memorizes" the data. So long as you don't have overlapping points of different classes (same feature values, different classes), each combination of feature values maps uniquely to one class value. It is always possible (although not always practical) to look at feature values and return the class of that sample, which is effectively deriving a new feature (which is simply the class) from the original feature space. – Nuclear Hoagie Nov 18 '20 at 14:57
  • Well there has to be an explicit construction that should be mathematically sound and not 'engineering' in nature. Can you give me a link where they explicitly construct such a map, purely using math or stat, with nor handwaving or by programming? – Mathmath Nov 18 '20 at 15:02
  • 1
    @Mathmath Cover's Theorem deals with the notion of separability in high-dimensional space, this is also related to non-linear kernel functions which can be used to map non-linearly separable data into a higher dimensional space where it is linearly separable. – Nuclear Hoagie Nov 18 '20 at 15:33
  • Thanks - yes I ran into people mentioning Cover's theorem in peoples' comments from elsewhere, but nowhere I see that exact mathematical statement. If you can point me to the exact theorem with a link, that'd be nice! – Mathmath Nov 18 '20 at 19:30
0

Yes, you can always linearly separate finite dimensional subsets by adding a dimension.

Proposition: If $X_0$ and $X_1$ are disjoint subsets of $\mathbb{R}^n$, then there exists function $f:\mathbb{R^n}\rightarrow\mathbb{R}^{n+1}$ such that $f(X_0)$ and $f(X_1)$ are linearly separable.

Proof: Define $f$ as follows:

$f(x) = (x, 0), $ for $x\in X_0$,

$f(x) = (x, 1), $ for $x\in X_1$,

and $f(x) = (x, y)$ where $y$ is arbitrary for $x$ elsewhere, e.g., $y$ is a realization of a Bernoilli random variable (coin toss) taking values of $0$ or $1$.

Recall the definition of linear separability of disjoint subsets $A$ and $B$ in vector space $U$ with inner product $\langle \cdot,\cdot\rangle$: there exists vector $v\in U$ and real $t$ such that $\langle a,v\rangle < t$ and $\langle b,v\rangle > t$ for all $a \in A$ and $b \in B$.

Let $A=f(X_0)$, $B=f(X_1)$, $v=(\vec{0},1)$ and $t=1/2$ in the above definition:

$$ \langle (x,0), (\vec{0},1) \rangle = 0 < 1/2 \qquad \forall (x,0) \in f(X_0)$$

and

$$ \langle (x,1), (\vec{0},1) \rangle = 1 > 1/2 \qquad \forall (x,1) \in f(X_1).$$

Therefore, hyperplane $\{ (x, 1/2) : x \in \mathbb{R}^n\} \subset \mathbb{R}^{n+1}$ linearly separates $f(X_0)$ and $f(X_1)$. QED

Unfortunately, this specific construction will not generalize well, as @Nuclear Hoagie pointed out.