27

According to this tutorial on deep learning, weight decay (regularization) is not usually applied to the bias terms b why?

What is significance (intuition) behind it?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Harshit
  • 371
  • 1
  • 3
  • 4
  • I think I have seen a very similar question before, I just cannot find it... Perhaps you should review related questions and would find the answer then. Also, perhaps [this](http://stats.stackexchange.com/questions/153933/importance-of-the-bias-in-neural-networks) could be somewhat useful. – Richard Hardy May 27 '15 at 19:35
  • I don't agree with the responders. One thing weight decay is for is normalization. But when you change just the weights and not the bias you also fundamentally change your layer's output, especially given that activation follows, not just scale it. I'll have to experiment, but I think it's proper to scale down the bias as well. – Íhor Mé Aug 05 '20 at 20:00

5 Answers5

28

Overfitting usually requires the output of the model to be sensitive to small changes in the input data (i.e. to exactly interpolate the target values, you tend to need a lot of curvature in the fitted function). The bias parameters don't contribute to the curvature of the model, so there is usually little point in regularising them as well.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
7

The motivation behind L2 (or L1) is that by restricting the weights, constraining the network, you are less likely to overfit. It makes little sense to restrict the weights of the biases since the biases are fixed (e.g. b = 1) thus work like neuron intercepts, which make sense to be given a higher flexibility.

Ramalho
  • 737
  • 5
  • 14
2

I would add that the bias term is often initialized with a mean of 1 rather than of 0, so we might want to regularize it in a way to not get too far away from a constant value like 1 such as doing 1/2*(bias-1)^2 rather than 1/2*(bias)^2.

Maybe that replacing the -1 part by a subtraction to the mean of the biases could help, maybe a per-layer mean or an overall one. Yet this is just a hypothesis I am doing (about the mean substraction).

This all depends on the activation function too. E.g.: sigmoids might be bad here for vanishing gradients if biases are regularized to high constant offsets.

2

Weights determine slopes of the activation functions. Regularization reduces the weights and hence the slopes of the activation functions. This reduces the model variance and the overfitting effect. The biases have no influence on the slopes of activation functions. However, they have an influence on the position of the activation functions in space. Their optimal values depend on the weights, so they should be adjusted to the regularized weights. The biases should be adjusted without regularization. Their regularization can be harmful. I considered the functions of weights and biases in randomized NN, see here

GMD
  • 21
  • 1
0

The tutorial says "applying weight decay to the bias units usually makes only a small difference to the final network", so if it does not help, then you can stop doing it to eliminate one hyperparameter. If you think regularizing the offset would help in your setup, then cross-validate it; there's no harm in trying.

Emre
  • 2,564
  • 15
  • 22