Ridge regularization - intuition behind $\lambda$

Question

I have seen many similar questions and I understand that $\lambda$ is some kind of a tuning parameter that decides how much we want to penalize the flexibility of our model. In other words $\lambda$ helps us decide how badly we want a perfect fit and how much bias are we willing to accept to get a nicely looking function, right?

But I'd like to understand the behavior of our model as we increase the tuning parameter. For $\lambda = 0$ all we care about is the fit. We get least squares fit. As the $\lambda$ increases, the model becomes less and less "spiky". It does not grow to a high values very fast only to go down again soon. It becomes more and more smooth.

And now finally, when lambda gets arbitrary large, $\lambda \rightarrow +\infty$, the penalty is very large and the coefficients will approach zero. Does that mean (from the graphical point of view) that as $\lambda$ grows it becomes smoother and smoother until it becomes "almost" flat and finally a horizontal line $y=0$? Or am I missing something?

score 2 · Accepted Answer · answered Sep 24 '20 at 08:55

2

Your understand is right.

We want to solve the problem of

$$\min_\beta \|y-X\beta\|^2 + \lambda \|\beta\|^2$$

As we can see, when $\lambda=0$, the objective function reduces to ordinary least squares.

When we increase $\lambda$, more and more weights are being put at the part to penalize large $\|\beta\|$ and hence we might sacrifice the fitness a bit.

As $\lambda \to \infty$, the priority is to make $\beta$ to be zero to avoid a large objective value, and hence $\beta \to 0$.

answered Sep 24 '20 at 08:55

Siong Thye Goh

6,431
3
17
28

Just one follow up question about the cost function. First we choose $\lambda$ and then try to solve the minimization problem? For a moment was thinking about choosing $\lambda$ in a way that minimizes the cost function, but ceteris paribus, it would mean $\lambda =0$? And it's just not the point here? – bajun65537 Sep 24 '20 at 09:15
For each data set, we have to try multiple values of $\lambda$. We choose the $\lambda$ that optimizes $AIC$ or $BIC$ or perform cross validation to measure the generalization. – Siong Thye Goh Sep 24 '20 at 09:19
1

@bajun65537 $\lambda$ is a *hyperparameter* that we tune to improve our out-of-sample fit at the expense of some in-sample fit, not a regular parathyroid we optimize for maximal in-sample performance. – Dave Sep 24 '20 at 10:59

Ridge regularization - intuition behind $\lambda$

1 Answers1