1

I understand that in order to avoid overfitting we need to reduce the complexity of the network. Or, in other words we can reduce the degree of polynomial. L1-norm does exactly this - reduces the degree of polynomial by making the solution sparse.

In L2-norm case the weights don't become zero but small. Thus, the degree of polynomial does not decrease, meaning the network does not become less complex, meaning it did not solve the overfitting. So, I don't understand how limiting the weights to not grow to big numbers help to avoid overfitting.

theateist
  • 231
  • 3
  • 8

1 Answers1

2

Reducing the weights limits the range of possibilities the model can achieve.

Imagine a simple linear model like this:

$$Y = b_0 + b_1x_1$$

For example's sake say $Y$ is the height of the person and $x_1$ is the weight of the person.

If we don't limit the range of $b_1$ then the model can fit any value to it. Based on what data we use to estimate the parameters, the model might decide that with each additional cm of height the weight of the person increases by 10kg.

By adding a penalty to high values of weights we prevent the model from choosing extremely high estimates. The model is more constrained - it becomes regularized. This limits overfitting because the range of possible values the model can have now is limited.

In turn when you have a lot of features such a constraint will prevent the model from basing all of it's results on a single feature. In the optimization step where the possible solutions will be between choosing one extremely reliable feature and several less reliable ones - the model will use several features. Just because putting big weight on one of them will become too costly. This again might reduce overfitting, because the model that uses several features often can be more robust.

Bottom line: by adding regularisation the space of possible models you can achieve is reduced. And simpler models are less variable, hence less overfitting (but potentially more biased).

The same reasoning should transfer to neural networks.

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
  • Can I say that by limiting the weights we limit how much it oscillates. For example, if the output function that approximates the data is $sin$, then by regularizing we basically limit its amplitude? – theateist Jun 18 '18 at 21:45
  • @theateist yes I think for understanding that's fine. It also shows how regularization is intimately linked to prior information: if you already know that amplitude cannot be very high - you add this information to the model (via contraints) and you get better estimates. But also keep in mind that these constraints most often are used when you have multiple features. In those cases you put constraint on the sum of the (squared) weights, and the model distributes the weights it "can afford" among all the features. In a typical scenario constraints restrict the total sum of weights only. – Karolis Koncevičius Jun 18 '18 at 21:53
  • In this case, when we have multiple features, meaning higher dimensionality, the output function is not $sin$ but some kind of hypersphere and we want to limit how it "oscilates". – theateist Jun 18 '18 at 22:11
  • @theateist yes you can think about it that way too. Thou the way it is often implemented is that you add a cost parameter "lambda" that controls how much the model get's punished. So it becomes a tradeoff between good fit and having simple weights. As a result it can still oscillate quite highly, but only in situations where all the other weights give very bad results. – Karolis Koncevičius Jun 18 '18 at 22:17
  • I have hard time to understand when someone says "a model gets punished" or "model prefers smaller weights". I don't see it mathematically. I would very glad if you good explain this mathematically (not in very difficult way) or provide a link. – theateist Jun 18 '18 at 22:27
  • I see. I thought you read about the mathematics of fit and were confused about how it could help with overfitting. Maybe try here: http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/ But I just found it with some googling. This site already has many answers on this topic and I see the question was already closed because it's a duplicate. So you could try reading the answers in the referenced question as well. Good luck. – Karolis Koncevičius Jun 19 '18 at 00:37