5

I see this expression: "regularization works because it keeps weights small" almost everywhere on the internet. So I'm going to make a semi-confident assertion here which is more of a question to test my understanding.

This is a misleading statement. Hand-wavingly making weights smaller won't achieve anything. If you take all the weights of a model that's overfitting and downsize them, the resulting function will be just as overly-complex but scaled down.

The real trick of regularization is that you are forcing the training to make choices about which weights it wants to keep large, and which weights it should get rid of by pushing them to zero. So regularization just imposes a kind of weights economy or rationing of weights to the model. The model has to choose which weights will give the most bang for their buck, and get rid of weights which add a small amount of value but don't contribute to a general fit.

So I would rephrase the general expression to: "Regularization works because it keeps a subset of the weights small".

EDIT

I think this answer confirms my thought process.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Alexander Soare
  • 549
  • 2
  • 11
  • 3
    “Keeping the weights small” isn’t such a bad way to describe it. The goal is variance reduction, so if your regularized weights can only be between -2 and 2, there should be less variability than if they can be between-20 and 20 (or really anywhere for unregularized). – Dave Mar 29 '20 at 17:51
  • 1
    @Dave Regularization minimizes some target error, hopefully a target not chosen naively, but the usual case is to choose naively or empirically. The connection between error propagation minimization and weights is abstract, so, no, I wouldn't say that. – Carl Mar 30 '20 at 04:00
  • @Dave I thought the variance being reduced is in what the model would resolve to when trained under varying training sets. So if the function that the model produces by learning with different training sets **varies** a lot then that's the variance people talk about - and it's not a good thing. But if the model function has strong variations in parameter space (a totally different concept) and those are indeed a good representation of the function it's trying to model, then that's fine. – Alexander Soare Mar 30 '20 at 05:40

1 Answers1

5

I don't think it is misleading, it is just a bit short. As regularization is (often) on some norm of the parameter (in your terminology weights) vector, like $\| \beta \|^2$ (ridge), it keeps the overall size of the parameters (weights) down. But it does not do so in a blind way, it does so while minimally destroying the other part (negative log likelihood, sum of squares, ...) of the criterion (loss) function. That is consistent with the explanation in your linked post, as restricting the model space.

Since the penalization term can be interpreted as a Bayesian prior in the model space, we are reducing the model space, not by cutting some parts of it off (as we would do by omitting some predictor, say), but by introducing a measure on the space, thereby down-weighting parts of it.

It turns out that we are downweighting those parts corresponding to large parameter vectors.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Good answer. I still think it might be misleading but that becomes more of an argument about English semantics than anything. – Alexander Soare Oct 25 '21 at 19:03