8

Very basic question here, but I would like to understand (not mathematically) how the fact to add a "penalty" (sum of squared coeff. times a scalar) to the residual sum of square can reduce big coefficients ? thanks !

TmSmth
  • 359
  • 2
  • 5
  • 3
    For a graphical / visual intuition have a look at these: https://stats.stackexchange.com/questions/350046/the-graphical-intuiton-of-the-lasso-in-case-p-2/351883#351883 , https://stats.stackexchange.com/questions/351631/graphical-path-coordinate-descent-in-case-of-semi-differentiable-functions-such/351689#351689 – Xavier Bourret Sicotte Jul 23 '18 at 09:55

1 Answers1

6

Because your "penalty" represenation of the minimization problem is just the langrange form of a constraint optimization problem:

Assume centered variables. In both cases, lasso and ridge, your unconstrained target function is then the usual sum of squared residuals; i.e. given $p$ regressors you minimize: $$RSS(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i-(x_{i,1}\beta_1 +\dots +x_{i,p}\beta_p))^2.$$ over all $\boldsymbol{\beta} =(\beta_1,\dots, \beta_p)$.

Now, in the case of a ridge regression you minimize $RSS(\boldsymbol{\beta})$ such that $$\sum_{i=1}^p\beta_p^2 \leq t_{ridge},$$ for some value of $t_{ridge}\geq 0$. For small values of $t_{ridge}$ it will be impossible to derive the same solution as in standard least square scenario in which case you only minimize $RSS(\boldsymbol{\beta})$ -- Think about $t_{ridge}=0$ then the only possible solution can be $\beta_1\equiv \dots \equiv \beta_p = 0$.

On the other hand, in the case of the lasso, you minimize $RSS(\boldsymbol{\beta})$ under the constraint $$\sum_{i=1}^p|\beta_p| \leq t_{lasso},$$ for some value of $t_{lasso}\geq 0$.

Both constrained optimization problems can be equivalently forumlated in terms of an unconstrained optimization problem, i.e. for the lasso: you can equivalently minimize

$$\sum_{i=1}^n (y_i-(x_{i,1}\beta_1 +\dots +x_{i,p}\beta_p))^2 + \lambda_{lasso}\sum_{i=1}^p|\beta_p|.$$

BloXX
  • 528
  • 4
  • 9