3

I have seen many similar questions and I understand that $\lambda$ is some kind of a tuning parameter that decides how much we want to penalize the flexibility of our model. In other words $\lambda$ helps us decide how badly we want a perfect fit and how much bias are we willing to accept to get a nicely looking function, right?

But I'd like to understand the behavior of our model as we increase the tuning parameter. For $\lambda = 0$ all we care about is the fit. We get least squares fit. As the $\lambda$ increases, the model becomes less and less "spiky". It does not grow to a high values very fast only to go down again soon. It becomes more and more smooth.

And now finally, when lambda gets arbitrary large, $\lambda \rightarrow +\infty$, the penalty is very large and the coefficients will approach zero. Does that mean (from the graphical point of view) that as $\lambda$ grows it becomes smoother and smoother until it becomes "almost" flat and finally a horizontal line $y=0$? Or am I missing something?

Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28
bajun65537
  • 585
  • 3
  • 13

1 Answers1

2

Your understand is right.

We want to solve the problem of

$$\min_\beta \|y-X\beta\|^2 + \lambda \|\beta\|^2$$

As we can see, when $\lambda=0$, the objective function reduces to ordinary least squares.

When we increase $\lambda$, more and more weights are being put at the part to penalize large $\|\beta\|$ and hence we might sacrifice the fitness a bit.

As $\lambda \to \infty$, the priority is to make $\beta$ to be zero to avoid a large objective value, and hence $\beta \to 0$.

Siong Thye Goh
  • 6,431
  • 3
  • 17
  • 28
  • Just one follow up question about the cost function. First we choose $\lambda$ and then try to solve the minimization problem? For a moment was thinking about choosing $\lambda$ in a way that minimizes the cost function, but ceteris paribus, it would mean $\lambda =0$? And it's just not the point here? – bajun65537 Sep 24 '20 at 09:15
  • For each data set, we have to try multiple values of $\lambda$. We choose the $\lambda$ that optimizes $AIC$ or $BIC$ or perform cross validation to measure the generalization. – Siong Thye Goh Sep 24 '20 at 09:19
  • 1
    @bajun65537 $\lambda$ is a *hyperparameter* that we tune to improve our out-of-sample fit at the expense of some in-sample fit, not a regular parathyroid we optimize for maximal in-sample performance. – Dave Sep 24 '20 at 10:59