4

The author in this video at minute 16:15 says that:

we don't want to choose big $\lambda$ values becuase the coefficients will become very small and therefore they might not be accurately reflecting what's going on

Does he mean by that that if the coefficients will become very small, close to zero, this might reduce the degree of polynomial so much that the network might become underfitting?

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
theateist
  • 231
  • 3
  • 8

4 Answers4

6

The issue is underfitting, yes. If $λ$ is too big, the coefficients will be smaller than they ought to be to get the best predictive accuracy. In the most extreme case of an arbitrarily large $λ$, all the coefficients are forced to be arbitrarily close to 0 regardless of the data, so the model is maximally underfit.

However, a ridge penalty will rarely change the degree of any polynomials, since it does not force any coefficients to be exactly 0.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
  • I know it's unrelated to my question, but is this the main reason we prefer ridge(l2) over lasso(l1) since ridge does not force any coefficients to be 0 exactly and thus reducing the change to start underfitting? – theateist Jun 18 '18 at 19:01
  • @theateist I'm not sure it's the main reason for preferring one over the other, but definitely the fact that the lasso sets a lot of coefficients to 0 is one of the main differences between the two. Sometimes it's a desirable property and sometimes it isn't. – Kodiologist Jun 18 '18 at 19:05
  • @Kodiologist I have a small doubt. If we have a high $\lambda$ value then multiplying lambda * coefs will increases the coefficients right? Why would it approx. to zero? I was not able to understand the idea behind it. – user_6396 Jul 09 '19 at 03:24
  • @user214 The multiplication in question only occurs in the penalty term. The larger the penalty term, the smaller the coefficient values must be made in order to minimize the objective function. – Kodiologist Jul 09 '19 at 11:42
2

Ridge regression penalizes "big" values of the coefficients $\beta$, and the degree of this penalization is proportional to $\lambda$.

On the one hand, you want to minimize the l2-norm of the residuals (this is the first part of the equation). The solution is the least-squares estimator $\hat \beta^{OLS}$.

On the other hand, you want to minimize the l2-norm of the $\beta s$. This would yield the 0 vector, as you might guess.

$\lambda$ comes as a compromise between the two. If $\lambda$ is zero, you might be overfitting (or not being able to compute a solution, if you are in a high dimensional setting). However, the bigger $\lambda$ is, the more importance you place on $\beta s$ being close to 0, as opposed to $\beta s$ providing a better fit. This means that the bigger the $\lambda$, the worse your fit. There must therefore be a $\lambda$ that avoids overfitting without leading to a bad fit. You'll learn techniques to choose such a $\lambda$, but there is no definite answer.

It is worth noting that Ridge regression will never "eliminate" parameters as your question suggests. If you want to do model selection, I would suggest looking into Lasso instead, for a start.

wiwh
  • 166
  • 5
2

Much like the three bears, one doesn't want to chose $\lambda$ too big or chose $\lambda$ too small. One wants to chose it just right. That is to say the commonly accepted practice is to chose $\lambda$ based on an empirical procedure such as CV. To arbitrarily pick the regularization coefficient is, imho, pointless. So I suspect that I would disagree with the underlying idea of 'don't chose $\lambda$ too big because one shouldn't 'chose' the $\lambda$ at all. The only thing I can point out in trying to understand what constant to chose is that if $\lambda$ is sufficiently large, the all the coefficients become zero and the estimate becomes just the mean value of the dependent variable.

meh
  • 1,902
  • 13
  • 18
2

Underfitting means that you are not capturing the trend enough (fitting the trend too little <-> underfitting).

This typically happens due to a simplified model (and on the other hand a complex model will too easily fit random error/variance, or overfit). E.g. when the degree of a polynomial is reduced, or when the number of regressors is reduced.

But it can also happen due to other types of bias. In ridge regression the penalized cost function will reduce the absolute value of the coefficients, which is

  • introducing/increasing a bias. Not by simplifying the model, or reducing the number of coefficients, but because the coefficients associated with the trend are shrunken.
  • but it also reduces the variance (the variance of the sampling distribution, variance of the test/experiment outcome , the outcome which won't be the same each time) because the coefficients associated with noise/random error tend to decrease faster (initially).

The ridge penalty is about this balance between bias and variance. The introduced bias can actually decrease the expectation value of the total error (bias+variance), when it is added up to a certain level (and beyond that level it will be too big for the reduced variance making up for it).

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161