I have often seen the statement that linear separability is more easily achieved in high dimensions, but I don't see why. Is it an empirical fact? An heuristic? Plain nonsense?
-
I should have specified that I'm *not* talking about kernel methods but about data that has a lot of features. – cpa Jul 31 '12 at 22:38
-
Linked to Dikran's answer [here](https://stats.stackexchange.com/questions/17711/why-does-ridge-regression-classifier-work-quite-well-for-text-classification) – Zhubarb Sep 28 '17 at 09:38
3 Answers
Trivially, if you have $N$ data points, they will be linearly separable in $N-1$ dimensions. Any structure in the data may reduce the required dimensionality for linear separation further. You might say that (a projection of) a data set either is or is not completely linearly separable, in which using any (projection into) dimensionality lower than $N-1$ requires either additional properties of the data, of the projection into this higher dimensionality, or can be viewed as a heuristic (for instance in the case of random projections). In general we usually do not care to much about precise separability, in which case it is sufficient that we can meaningfully separate more data points correctly in higher dimensions.

- 728
- 3
- 13
-
Ok, I thought there'd be a more combinatorial argument but that's ok for me! Thanks! – cpa Aug 02 '12 at 08:44
-
Hi, I'm not sure I understand your answer: when you say "if you have $N$ data points, they will be linearly separable in...", what do you mean? I think I'm used to seeing two sets of data in the same dimension can be linearly separable or not. But you didn't use the phrase "two sets of $N-1$ dimensional data", this is what I'm not following. Thanks for clarifying! – Mathmath Nov 18 '20 at 13:29
I'm not sure if it matters whether the data actually has a high dimensionality or whether data is projected into a higher dimension. In the latter case, it is true that it's easier to linearly separate something projected into a higher dimension, hence the whole idea of kernel methods. (See Cover's Theorem, etc.)
My typical example is a bullseye-shaped data set, where you have two-dimensional data with one class totally surrounded by another. Not linearly separable in 2 dimensions, but project it into 3 dimensions, with the third dimension being the point's distance from the center, and it's linearly separable.

- 19,981
- 4
- 50
- 99
I think what you might be asking about is the use of kernels to make a data set more compatible with linear techniques. A short piece about this is available here: http://ldtopology.wordpress.com/2012/05/27/making-linear-data-algorithms-less-linear-kernels/.

- 1,018
- 1
- 7
- 10