Why does machine learning work for high-dimensional data($n \ll p$)?

Question

Consider the high dimensional data with which the number of features $p$ is much larger than the number of observations $n$. Machine learning algorithm is trained with the data.

My first thought is that a learning algorithm trained with the high dimensional data would have large model variance and so poor prediction accuracy.

To construct a model, we need to decide the parameters of models and the number of parameters gets larger when the number of features increases. And for the wide data, we would not have enough observations to decide all the parameters reliably. I think that the parameters of the model will change sensitively with the change of train samples. The instability of the model parameters indicates that there would be large model variance which will worsen the prediction performance.

However, I read that the machine learning models trained with high-dimensional data can make good predictions. I am curious about what is the underlying reason ML works for the prediction of the high-dimensional data($n \ll p$).

Some ML algos can indeed be used to fit models on wide data. Whether these models are good is another question. — Michael M, Jun 12 '20 at 07:11
"Wide" data is not the best name for describing p > n, since ["wide" and "narrow", or "tall"](https://en.wikipedia.org/wiki/Wide_and_narrow_data) is already used for naming the format how the data is *stored*. — Tim, Jun 12 '20 at 08:07
@Tim Thanks, I didn't know about the table types. Could you suggest a name instead of the 'wide'? — hbadger19042, Jun 12 '20 at 08:50
@kevin012 I'm not sure is there is a name, people usually just say that it is a data where "$n \ll p$" — Tim, Jun 12 '20 at 08:56
I think we need a name for that. I was searching it a while ago (for LDA), and "n < p" doesn't help on google — carlo, Jun 12 '20 at 10:31
When you regularize the coefficients (which multiple answers refer to), you are essentially reducing the dimension of the parameter space so the effective degrees of freedom are smaller than the nominal degrees of freedom. — Do not reinstate Monica, Jun 12 '20 at 13:38
@Carlo, say you had a covariate with unique values 1, 2, ..., 1000 and you wanted to fully non-parametrically estimate its relationship with the outcome. That would be a regression problem with $p = 1000$ regression parameters. If you smooth the relationship so that it's linear, you have $p = 2$ parameters. — Do not reinstate Monica, Jun 15 '20 at 18:50
@carlo, look into smoothing splines and effective degrees of freedom — Do not reinstate Monica, Jun 15 '20 at 21:50
Also, @carlo, you define effective degrees of freedom similarly in things like ridge regression (which you can use when the nominal degrees of freedom exceed the sample size) as the trace of the smoother matrix. I don't know what "proper regularization" is though so perhaps I've missed the scope of your comment. — Do not reinstate Monica, Jun 16 '20 at 12:53
mmh, thanks for the pointers. I knew smoothing splines, but not that thing about ridge regression. — carlo, Jun 16 '20 at 19:27

Tim · Accepted Answer · 2020-06-12T13:00:59.497

Most of the machine learning models use some kind of regularization (see other questions tagged as regularization). In simple words, what regularization does it forces the model to be more simple then it can be. To give few examples:

LASSO forces pushing regression parameters towards zero, so practically removing them from the model.
Dropout that turns on and off different parts of the neural network, so that it needs to learn how to work with smaller sub-networks, instead of using all parameters, what makes it more flexible.
When using bagging, you train multiple models using different, random, subsamples of the data, usually subsampling also the columns, and then aggregate them. So the individual models in your ensemble will need to learn how to use different features, and aggregating multiple models would "cancel out" scenarios where the individual model overfit.

Moreover, some recent results show that even without explicit regularization, neural networks, but also some other models, are able to work well in scenarios where they have many more parameters then datapoints, so in cases where they could literally memorize the whole data and overfit. Apparently, that is not the case and the models seem to regularize themselves, but the mechanism is still not known to us. This would suggest that we may not understand why this happends well enough yet.

carlo · Answer 2 · 2020-06-12T07:48:02.227

3

One word: regularization. The complexity of a model is indeed more or less proportional to the number of predictors (this depends on the model), but ML algorithms use regularization to split the predictive burden between the different preductors, and finally yield a cautious outcome.

This works so well that, even when p is small, you can use a kernel method to embed your data to a infinite-dimensional space and trough regularization effectively learn a generalizable model from there.

You can't apply kernel methods to ordinary linear regression, that would be instable. But you can apply them to ridge regression, because it includes regularization.

edited Jun 12 '20 at 07:48

answered Jun 12 '20 at 07:41

carlo

4,243
1
11
26

Thanks, carlo. But I'm not quite sure about the kernel methods of solving the problem in the infinite-dimensional space. Could you give me a link to any material on the approach? – hbadger19042 Jun 29 '20 at 12:23
more like the kernel trick *creates* the problem of a higher (even infinite) dimensional space. this is why kernel methods always include regularization: it is to it to solve the problem of high (or even infinite) dimensional space. I learned kernel methods on the book by Geron: hands-on machine learning, which explains them very well (but really, there is so much more about it). a good thing about kernel methods is that they scale linearly on *p*, but unfortunately they scale very badly with *n* and they are too slow overall. – carlo Jun 30 '20 at 09:11

score 1 · Answer 3 · answered Jun 12 '20 at 13:18

The description in the second paragraph of OP's question describes the phenomenon where regression coefficients cannot be uniquely determined when the design matrix is not full rank.

To construct a model, we need to decide the parameters of models and the number of parameters is proportional to the number of predictors. And for the wide data, we don't have enough data to decide all the parameters reliably. I guess with the wide data, the parameters of the model will change all the time with the small change of data. There would not be any stable solution for the model. And the instability indicates that there would be large model variance which will worsen the prediction performance.

The answer to this is two parts.

Not all machine learning models involve estimating a coefficient vector for a matrix-vector product. For example, random forest can find a best binary split for some data even when $n \ll p$ because finding a split doesn't involve solving a linear system.
For machine learning models that do involve a matrix-vector product (e.g. OLS or logistic regression), the addition of a penalty term can make the optimization problem strongly convex, and therefore identify a unique minimum of the loss function. See: Why does ridge estimate become better than OLS by adding a constant to the diagonal? Three common examples of penalized regression are ridge-regression, lasso regression, and elastic-net regression. This penalty is a form of regularization, because it limits the flexibility of the model.

The other answers are correct that regularization is why machine learning models can do well in terms of prediction when $n \ll p$, but they don't quite connect that concept to the rank deficiency component of your question.

Why does machine learning work for high-dimensional data($n \ll p$)?

3 Answers3

Linked