Ridge Regression -Increase in $\lambda$ leads to a decrease in flexibilty

Question

In Introduction to Statistical Learning, in the part where ridge regression is explained, the authors say that

As $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.

Here is my take on proving this line:
In ridge regression we have to minimize the sum:$$RSS+\lambda\sum_{j=0}^n\beta_j\\=\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2$$
Here, we can see that a general increase in the $\beta$ vector will decrease $RSS$ and increase the other term. So, in order to minimize the whole term, a kind of equilibrium must be made between the $RSS$ term and the $\lambda\sum_{j=0}^p\beta_j^2$ term. Let their sum be $S$.
Now, if we increase $\lambda$ by $1$, then by using the previous value of the $\beta$ vector, $\lambda\sum_{j=1}^p\beta_j^2$ will increase, whereas $RSS$ will remain the same. Thus $S$ will increase. Now, to attain another equilibrium, we can see that decreasing the coefficients $\beta_j$ will restore the equilibrium.$^{[1]}$

Therefore as a general trend, we can say that if we increase the value of $\lambda$ then the magnitude of the coefficients decreases.

Now, if the co-efficients of predictors decrease, then their value in the model decreases. That is, their effect decreases. And thus the flexibility of the model should decrease.

This proof appears appealing, but I have a gut feeling that there are some gaps here and there. If it is correct, good. But if it isn't I would like to know the reasons where this proof fails, and obviously, the correct version of it.

$^{[1]}$: I can attach a plausible explanation on this point, if needed.

user795305 · Answer 1 · 2017-08-11T11:52:23.830

7

This can be most easily seen through Lagrange duality: there exists some $C$ so that $$\arg\min_{\beta \in \mathbb{R}^p} RSS + \lambda \sum_{i=0}^p \beta_i^2 = \arg\min_{\beta\in\mathbb{R}^p \, : \, \|\beta\|_2^2 \leq C} RSS.$$ Further, we know that larger $\lambda$ corresponds to smaller $C$. Therefore, increasing the tuning parameter $\lambda$ further constrains the $\ell_2$ norm of the coefficients, leading to less flexibility.

edited Aug 11 '17 at 11:52

answered Aug 10 '17 at 17:02

user795305

2,692
1
20
40

I read [this](http://cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf) document, but I still can't understand your answer. Can you point me towards some resources regarding your answer? – Mooncrater Aug 11 '17 at 10:27
@Mooncrater Could you point out exactly which part of the article/my answer that you don't understand? – user795305 Aug 11 '17 at 11:47
basically how Lagrange duality works here, to convert the LHS to the RHS in your answer. – Mooncrater Aug 11 '17 at 14:47
1

I think a nice explanation can be found here: https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf – user795305 Aug 11 '17 at 16:02
beautiful answer. – independentvariable Apr 28 '20 at 21:58

score 7 · Accepted Answer · edited Jun 11 '20 at 14:32

Let's ignore the penalty term for a moment, while we explore the sensitivity of the solution to changes in a single observation. This has ramifications for all linear least-squares models, not just Ridge regression.

Notation

To simplify the notation, let $X$ be the model matrix, including a column of constant values (and therefore having $p+1$ columns indexed from $0$ through $p$), let $y$ be the response $n$-vector, and let $\beta=(\beta_0, \beta_1, \ldots, \beta_p)$ be the $p+1$-vector of coefficients. Write $\mathbf{x}_i = (x_{i0}, x_{i1}, \ldots, x_{ip})$ for observation $i$. The unpenalized objective is the (squared) $L_2$ norm of the difference,

$$RSS(\beta)=||y - X\beta||^2 = \sum_{i=1}^n (y_i - \mathbf{x}_i\beta)^2.\tag{1}$$

Without any loss of generality, order the observations so the one in question is the last. Let $k$ be the index of any one of the variables ($0 \le k \le p$).

Analysis

The aim is to expose the essential simplicity of this situation by focusing on how the sum of squares $RSS$ depends on $x_{nk}$ and $\beta_k$--nothing else matters. To this end, split $RSS$ into the contributions from the first $n-1$ observations and the last one:

$$RSS(\beta) = (y_n - \mathbf{x}_n\beta)^2 + \sum_{i=1}^{n-1} (y_i - \mathbf{x}_i\beta)^2.$$

Both terms are quadratic functions of $\beta_k$. Considering all the other $\beta_j,$ $j\ne k$, as constants for the moment, this means the objective can be written in the form

$$RSS(\beta_k) = (x_{nk}^2 \beta_k^2 + E\beta_kx_{nk} + F) + (A^2\beta_k^2 + B\beta_k + C).$$

The new quantities $A\cdots F$ do not depend on $\beta_k$ or $x_{nk}$. Combining the terms and completing the square gives something in the form

$$RSS(\beta_k) = \left(\beta_k\sqrt{x_{nk}^2 + A^2} + \frac{Ex_{nk}+B}{2\sqrt{x_{nk}^2+A^2}} \right)^2 + G - \frac{(Ex_{nk}+B)^2}{4(x_{nk}^2+A^2)}\tag{2}$$

where the quantity $G$ does not depend on $x_{nk}$ or $\beta_k$.

Estimating sensitivity

We may readily estimate the sizes of the coefficients in $(2)$ when $|x_{nk}|$ grows large compared to $|A|$. When that is the case,

$$RSS(\beta_k) \approx \left(\beta_k x_{nk} + E/2\right)^2 + G-E^2/4.$$

This makes it easy to see what changing $|x_{nk}|$ must do to the optimum $\hat\beta_k$. For sufficiently large $|x_{nk}|$, $\beta_k$ will be inversely proportional to $x_{nk}$.

We actually have learned, and proven, much more than was requested, because Ridge regression can be formulated as model $(1)$. Specifically, to the original $n$ observations you will adjoin $p+1$ fake observations of the form $\mathbf{x}_{n+i} = (0,0,\ldots, 0,1,0,\ldots,0)$ and then you will multiply them all by the penalty parameter $\lambda$. The preceding analysis shows that for $\lambda$ sufficiently large (and "sufficiently" can be computed in terms of $|A|$, which is a function of the actual data only), every one of the $\hat\beta_k$ will be approximately inversely proportional to $\lambda$.

An analysis that requires some more sophisticated results from Linear Algebra appears at The proof of shrinking coefficients using ridge regression through "spectral decomposition". It does add one insight: the coefficients in the asymptotic relationships $\hat\beta_k \sim 1/\lambda$ will be the reciprocal nonzero singular values of $X$.

But the penalty parameter $\lambda$ is multiplied by $\sum_{i=0}^p\beta_i^2$, so why are we multiplying it with the observations $x_{n+i}$? — Mooncrater, Aug 12 '17 at 11:46
Mooncrater, Please look at the thread I referenced at https://stats.stackexchange.com/a/164546/919. It shows how Ridge regression can be implemented as OLS regression by adding fake observations and multiplying them all simultaneously by $\lambda$. Now apply the analysis here, one fake observation at a time, to see how that makes each $\beta_k$ asymptotically inversely proportional to $\lambda$. — whuber, Aug 12 '17 at 14:54

score 1 · Answer 3 · edited Aug 11 '17 at 10:04

1

Here, we can see that a general increase in the β vector will decrease RSS and increase the other term.

That is not strictly true. For example, check what happens to your $RSS$ if $p$ is $1$ and $y=0$ for all $n$ points as you increase $\beta$.

edited Aug 11 '17 at 10:04

Mooncrater

737
2
9
19

answered Aug 10 '17 at 15:44

rinspy

3,188
10
40

1

You're right. In that case, $RSS$ will increase with an increase in $\beta$. – Mooncrater Aug 11 '17 at 10:31

Ridge Regression -Increase in $\lambda$ leads to a decrease in flexibilty

3 Answers3

Notation

Analysis

Estimating sensitivity

Linked