6

In Introduction to Statistical Learning, in the part where ridge regression is explained, the authors say that

As $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.

Here is my take on proving this line:
In ridge regression we have to minimize the sum:$$RSS+\lambda\sum_{j=0}^n\beta_j\\=\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2$$
Here, we can see that a general increase in the $\beta$ vector will decrease $RSS$ and increase the other term. So, in order to minimize the whole term, a kind of equilibrium must be made between the $RSS$ term and the $\lambda\sum_{j=0}^p\beta_j^2$ term. Let their sum be $S$.
Now, if we increase $\lambda$ by $1$, then by using the previous value of the $\beta$ vector, $\lambda\sum_{j=1}^p\beta_j^2$ will increase, whereas $RSS$ will remain the same. Thus $S$ will increase. Now, to attain another equilibrium, we can see that decreasing the coefficients $\beta_j$ will restore the equilibrium.$^{[1]}$

Therefore as a general trend, we can say that if we increase the value of $\lambda$ then the magnitude of the coefficients decreases.

Now, if the co-efficients of predictors decrease, then their value in the model decreases. That is, their effect decreases. And thus the flexibility of the model should decrease.


This proof appears appealing, but I have a gut feeling that there are some gaps here and there. If it is correct, good. But if it isn't I would like to know the reasons where this proof fails, and obviously, the correct version of it.


$^{[1]}$: I can attach a plausible explanation on this point, if needed.

Mooncrater
  • 737
  • 2
  • 9
  • 19

3 Answers3

7

This can be most easily seen through Lagrange duality: there exists some $C$ so that $$\arg\min_{\beta \in \mathbb{R}^p} RSS + \lambda \sum_{i=0}^p \beta_i^2 = \arg\min_{\beta\in\mathbb{R}^p \, : \, \|\beta\|_2^2 \leq C} RSS.$$ Further, we know that larger $\lambda$ corresponds to smaller $C$. Therefore, increasing the tuning parameter $\lambda$ further constrains the $\ell_2$ norm of the coefficients, leading to less flexibility.

user795305
  • 2,692
  • 1
  • 20
  • 40
7

Let's ignore the penalty term for a moment, while we explore the sensitivity of the solution to changes in a single observation. This has ramifications for all linear least-squares models, not just Ridge regression.

Notation

To simplify the notation, let $X$ be the model matrix, including a column of constant values (and therefore having $p+1$ columns indexed from $0$ through $p$), let $y$ be the response $n$-vector, and let $\beta=(\beta_0, \beta_1, \ldots, \beta_p)$ be the $p+1$-vector of coefficients. Write $\mathbf{x}_i = (x_{i0}, x_{i1}, \ldots, x_{ip})$ for observation $i$. The unpenalized objective is the (squared) $L_2$ norm of the difference,

$$RSS(\beta)=||y - X\beta||^2 = \sum_{i=1}^n (y_i - \mathbf{x}_i\beta)^2.\tag{1}$$

Without any loss of generality, order the observations so the one in question is the last. Let $k$ be the index of any one of the variables ($0 \le k \le p$).

Analysis

The aim is to expose the essential simplicity of this situation by focusing on how the sum of squares $RSS$ depends on $x_{nk}$ and $\beta_k$--nothing else matters. To this end, split $RSS$ into the contributions from the first $n-1$ observations and the last one:

$$RSS(\beta) = (y_n - \mathbf{x}_n\beta)^2 + \sum_{i=1}^{n-1} (y_i - \mathbf{x}_i\beta)^2.$$

Both terms are quadratic functions of $\beta_k$. Considering all the other $\beta_j,$ $j\ne k$, as constants for the moment, this means the objective can be written in the form

$$RSS(\beta_k) = (x_{nk}^2 \beta_k^2 + E\beta_kx_{nk} + F) + (A^2\beta_k^2 + B\beta_k + C).$$

The new quantities $A\cdots F$ do not depend on $\beta_k$ or $x_{nk}$. Combining the terms and completing the square gives something in the form

$$RSS(\beta_k) = \left(\beta_k\sqrt{x_{nk}^2 + A^2} + \frac{Ex_{nk}+B}{2\sqrt{x_{nk}^2+A^2}} \right)^2 + G - \frac{(Ex_{nk}+B)^2}{4(x_{nk}^2+A^2)}\tag{2}$$

where the quantity $G$ does not depend on $x_{nk}$ or $\beta_k$.

Estimating sensitivity

We may readily estimate the sizes of the coefficients in $(2)$ when $|x_{nk}|$ grows large compared to $|A|$. When that is the case,

$$RSS(\beta_k) \approx \left(\beta_k x_{nk} + E/2\right)^2 + G-E^2/4.$$

This makes it easy to see what changing $|x_{nk}|$ must do to the optimum $\hat\beta_k$. For sufficiently large $|x_{nk}|$, $\beta_k$ will be inversely proportional to $x_{nk}$.

We actually have learned, and proven, much more than was requested, because Ridge regression can be formulated as model $(1)$. Specifically, to the original $n$ observations you will adjoin $p+1$ fake observations of the form $\mathbf{x}_{n+i} = (0,0,\ldots, 0,1,0,\ldots,0)$ and then you will multiply them all by the penalty parameter $\lambda$. The preceding analysis shows that for $\lambda$ sufficiently large (and "sufficiently" can be computed in terms of $|A|$, which is a function of the actual data only), every one of the $\hat\beta_k$ will be approximately inversely proportional to $\lambda$.


An analysis that requires some more sophisticated results from Linear Algebra appears at The proof of shrinking coefficients using ridge regression through "spectral decomposition". It does add one insight: the coefficients in the asymptotic relationships $\hat\beta_k \sim 1/\lambda$ will be the reciprocal nonzero singular values of $X$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • But the penalty parameter $\lambda$ is multiplied by $\sum_{i=0}^p\beta_i^2$, so why are we multiplying it with the observations $x_{n+i}$? – Mooncrater Aug 12 '17 at 11:46
  • 1
    Mooncrater, Please look at the thread I referenced at https://stats.stackexchange.com/a/164546/919. It shows how Ridge regression can be implemented as OLS regression by adding fake observations and multiplying them all simultaneously by $\lambda$. Now apply the analysis here, one fake observation at a time, to see how that makes each $\beta_k$ asymptotically inversely proportional to $\lambda$. – whuber Aug 12 '17 at 14:54
1

Here, we can see that a general increase in the β vector will decrease RSS and increase the other term.

  • That is not strictly true. For example, check what happens to your $RSS$ if $p$ is $1$ and $y=0$ for all $n$ points as you increase $\beta$.
Mooncrater
  • 737
  • 2
  • 9
  • 19
rinspy
  • 3,188
  • 10
  • 40