3
2
I've been curious why machine learning algorithms with high VC dimension, such as XGBoost and deep learning, do well in practice. The answer appears to be regularization significantly restricting the parameter space, but the only justifications I've seen are references to Occam's razor.
Is there quantitative/theoretical understanding of how much regularization can prevent overfitting in a model?
Background: I've taken a course in machine learning and covered the theory behind regularization and a couple techniques, such as lasso and ridge regression. I understand the principle that regularization can reduce the VC dimension by minimizing or zeroing weights in the model.
But, that principle does not clarify in my mind whether regularization is adequate to counteract the high VC dimension in models used by XGBoost and deep learning.
I am asking for some sort of quantitative theory that provides justification that even with high VC dimension regularization can reduce the dimensionality enough to provide a decent guarantee of generalization.
Alternatively, providing a method I can use to figure this out myself is also acceptable.
1
What research and reading have you done? There's lots written on regularization and its purpose, including https://en.wikipedia.org/wiki/Regularization_%28mathematics%29 and https://stats.stackexchange.com/q/4961/2921 and many more. There seems to be little point in us repeating standard material. This site works best if you do a significant amount of research before asking, and show us in the question what you've found so far. Why are you unsatisfied the references to Occam's razor?
– D.W. – 2017-05-16T22:38:37.0701@D.W. Is it even clear that deep learning does regularisation in the sense described in the Wikipedia page you described? – Martin Berger – 2017-05-17T09:22:52.220
@MartinBerger, yes, in practice deep learning usually does include regularization, in some form or other (dropout, batch normalization, early stopping of gradient descent, etc.). – D.W. – 2017-05-17T15:12:44.223
@D.W., these authors claim the regularization used during training of deep learning networks does not explain the small generalization error of the networks. https://arxiv.org/abs/1611.03530
– yters – 2017-05-18T19:55:08.2071
@yters, as far as I can tell, it's an active area of research. There's a later paper responding to that one, https://openreview.net/pdf?id=rJv6ZgHYg, which seems to suggest that regularization might indeed be playing a role in explaining the small generalization error.
– D.W. – 2017-05-18T19:59:54.0471
Here is another great post on this topic: http://www.inference.vc/everything-that-works-works-because-its-bayesian-2/
– Nicholas Mancuso – 2017-05-31T23:20:56.8031@NicholasMancuso, great article. If gradient descent is the magic that makes deep learning generalize, then deep learning itself is not special, since gradient descent is applied to many machine learning models. – yters – 2017-06-02T14:33:45.517
1@yters, I think it's still too early to say with certainty. But if SGD acting as an approximate Bayes approach is the magic sauce behind DNNs, then why don't we see similar performance from more "sophisticated" Bayesian models that don't require approximations, such as GPs? – Nicholas Mancuso – 2017-06-02T15:40:13.780