How much does regularization prevent overfitting?

3

2

I've been curious why machine learning algorithms with high VC dimension, such as XGBoost and deep learning, do well in practice. The answer appears to be regularization significantly restricting the parameter space, but the only justifications I've seen are references to Occam's razor.

Is there quantitative/theoretical understanding of how much regularization can prevent overfitting in a model?

Background: I've taken a course in machine learning and covered the theory behind regularization and a couple techniques, such as lasso and ridge regression. I understand the principle that regularization can reduce the VC dimension by minimizing or zeroing weights in the model.

But, that principle does not clarify in my mind whether regularization is adequate to counteract the high VC dimension in models used by XGBoost and deep learning.

I am asking for some sort of quantitative theory that provides justification that even with high VC dimension regularization can reduce the dimensionality enough to provide a decent guarantee of generalization.

Alternatively, providing a method I can use to figure this out myself is also acceptable.

yters

Posted 2017-05-16T02:15:40.080

Reputation: 1 305

1

What research and reading have you done? There's lots written on regularization and its purpose, including https://en.wikipedia.org/wiki/Regularization_%28mathematics%29 and https://stats.stackexchange.com/q/4961/2921 and many more. There seems to be little point in us repeating standard material. This site works best if you do a significant amount of research before asking, and show us in the question what you've found so far. Why are you unsatisfied the references to Occam's razor?

– D.W. – 2017-05-16T22:38:37.070

1@D.W. Is it even clear that deep learning does regularisation in the sense described in the Wikipedia page you described? – Martin Berger – 2017-05-17T09:22:52.220

@MartinBerger, yes, in practice deep learning usually does include regularization, in some form or other (dropout, batch normalization, early stopping of gradient descent, etc.). – D.W. – 2017-05-17T15:12:44.223

@D.W., these authors claim the regularization used during training of deep learning networks does not explain the small generalization error of the networks. https://arxiv.org/abs/1611.03530

– yters – 2017-05-18T19:55:08.207

1

@yters, as far as I can tell, it's an active area of research. There's a later paper responding to that one, https://openreview.net/pdf?id=rJv6ZgHYg, which seems to suggest that regularization might indeed be playing a role in explaining the small generalization error.

– D.W. – 2017-05-18T19:59:54.047

1

Here is another great post on this topic: http://www.inference.vc/everything-that-works-works-because-its-bayesian-2/

– Nicholas Mancuso – 2017-05-31T23:20:56.803

1@NicholasMancuso, great article. If gradient descent is the magic that makes deep learning generalize, then deep learning itself is not special, since gradient descent is applied to many machine learning models. – yters – 2017-06-02T14:33:45.517

1@yters, I think it's still too early to say with certainty. But if SGD acting as an approximate Bayes approach is the magic sauce behind DNNs, then why don't we see similar performance from more "sophisticated" Bayesian models that don't require approximations, such as GPs? – Nicholas Mancuso – 2017-06-02T15:40:13.780

Answers

4

Moritz Hardt's excellent blog outlines extensive research into the idea that stability in machine learning methods is related to the idea of generalization. Interestingly, we find that regularization implies stability.

Formally, given observations $S = (z_1, \dotsc, z_n)$ and algorithm $A$, stability is defined as the expected difference between the risk and empirical risk $$\mathbb{E}[R - R_e] = \Delta,$$ where $R_e = \ell(z, S) = \frac{1}{n} \sum_i \ell(z_i, A(S))$ is the empirical risk and $R = \mathbb{E}_{z \sim D}[\ell(z, A(S))]$ is the risk.

The post details how to obtain bounds on $\Delta$ and its implications much better than I ever can. I highly recommend reading that post along with all of his other posts.

A much more practical argument is that regularization constrains the problem space to a very specific subset of models; namely, regularization methods can be recast as constrained optimization where $\|\beta\|_p \leq c$ for some value of $p$. This is equivalent to using prior knowledge about the problem and forcing the solution to be within some constrained space.

Nicholas Mancuso

Posted 2017-05-16T02:15:40.080

Reputation: 3 797

This is interesting, in particular he states: "Our algorithm’s output shouldn’t change much if perturb one of the data points." Yet, in the case of deep learning we know that it only takes slight perturbations to change the output dramatically. Taking this post at its word, this implies deep learning does not generalize. Is that a fair assessment? If so, then this answer gives me the insight I need and I'll accept it. – yters – 2017-05-18T02:06:17.173

1@yters I don't think that is a fair assessment. In the deep learning sense, this would be equivalent to altering a single picture out of the entire data set and retraining. With that said, that post, along with its links describe enough of the mathematical frameworks to begin your own formal analysis of deep learning. – Nicholas Mancuso – 2017-05-18T17:15:36.213

True, but wouldn't this robustness to perturbation apply not only to the training process, but the final model itself? It seems any significant discontinuity in output from slight perturbations to input fed to the final model also mean that slight changes in the training data will significantly change the final model. But, as you say, I'll need to conduct formal analysis myself. – yters – 2017-05-18T17:35:47.697

This comment seems to be saying something similar: http://disq.us/p/18hvn9n

– yters – 2017-05-18T19:42:27.517