2

Consider a binary classification problem on $\mathcal Z=\mathcal X \times \mathcal Y $, where $\mathcal X$ is some subset of $\mathbb R^d$ and $\mathcal Y = \{-1;1\}$. You have a dataset of $n$ samples $(x_i,y_i)\in\mathcal X \times \mathcal Y $ which are i.i.d. according to some probability distribution on $\mathcal Z$.

Now, say you train a classifier $\hat f$ on this dataset (could be an SVM, a Neural Network, Naive Bayes... anything really). In this setting, one would define the margin $\hat \gamma$ of the classifier $\hat f$ as $$\hat \gamma := \min_{1\le i\le n} y_i\ \hat f(x_i) $$

My question is the following : is there any relationship between the margin $\mathbf{\hat \gamma}$ and the dimension of the problem $\mathbf d$ ? What if the problem is infinite dimensional (e.g. functional data classification) ?

Intuitively, due to the curse of dimensionality, one would expect the points to be "more separated" as the dimension increases and thus a larger margin on the training sample should also be progressively easier to achieve, maybe up to some assumptions on the underlying distribution of the datapoints (e.g. Tsybakov noise condition).
I have however not been able to find any results of that sort in the literature, and I'm not sure of how to prove it myself.

Any mention of a paper/book/resource that discusses or tackles a similar question, or ideas on how to approach it would be very much appreciated.


Update : Here is my attempt at being more specific in explaining what I'm looking for, please bear with me.
It is known (or at least observed in practical cases) that, for any given class of classifiers $\mathcal C_d$, as the dimensionality $d$ increases, the classifiers in $C_d$ will be able to progressively achieve better results on the training dataset. In the limit case of $d=\infty$, it has even been shown that zero training error is achievable with minimal assumptions (Delaigle and Hall (2012)). Although this fact can be intuitively understood as a consequence of the curse of dimensionality, the theoretical results supporting it are fairly scarce.

I would thus like to know how to quantify how "easier" it gets to classify training data as the dimension increases : It seems that the classification margin is one the most appropriate metric to measure performance of a classifier, as it effectively measures how well-separated the classifier manages to make the points. On top of that, margin has been quite extensively studied from a theoretical standpoint. In particular, it has been shown that having a margin as large as possible is desirable (Pappu and Pardalos (2014), Vapnik and Chapelle (2000), Cai, Zhang and Peng (2009)...).

So far, the only references I have been able to find which investigate the relationship between margin and dimensionality only deal with the asymptotics of the overparametrized regime, i.e. when $$n\to\infty,\ \ d\to\infty,\ \ n/d\to\psi\in\mathbb R$$ and seem to support the intuition that learning to separate the training set gets easier as dimension increases (Huang (2019), Montanari et al. (2019)). I would be interested in results/discussions for the non overparametrized, non-asymptotic case, i.e. for fixed $n$ and $d$, or for fixed $n$ and $d\to\infty$.

Lastly, it seems, according to my knowledge, that SVMs (which are margin-based) seem to be among the best performing algorithms in high-dimensional settings, i.e. they are less prone to overfitting compared to other models. Any reference which investigate that issue from a theoretical perspective or provide key related results ?

StratosFair
  • 353
  • 2
  • 10

0 Answers0