Lucid explanation for "numerical stability of matrix inversion" in ridge regression and its role in reducing overfit

Question

I understand that we can employ regularization in a least squares regression problem as

$$\boldsymbol{w}^* = \operatorname*{argmin}_w \left[ (\mathbf y-\mathbf{Xw})^T(\boldsymbol{y}-\mathbf{Xw}) + \lambda\|\boldsymbol{w}\|^2 \right]$$

and that this problem has a closed-form solution as:

$$\hat{\boldsymbol{w}} = (\boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I})^{-1}\boldsymbol{X}^T\boldsymbol{y}.$$

We see that in the 2nd equation, regularization is simply adding $\lambda$ to the diagonal of $\boldsymbol{X}^T\boldsymbol{X}$, which is done to improve the numerical stability of matrix inversion.

My current 'crude' understanding of numerical stability is that if a function becomes more 'numerically stable' then its output will be less significantly affected by the noise in its inputs. I am having difficulties relating this concept of improved numerical stability to the bigger picture of how it avoids/reduces the problem of overfitting.

I have tried looking at Wikipedia and a few other university websites, but they don't go deep into explaining why this is so.

Ridge regression comes to mind. [link](https://en.wikipedia.org/wiki/Tikhonov_regularization) — EngrStudent, Feb 28 '16 at 21:41
You may find some value in the (mostly descriptive/intuitive rather than algebraic) discussion at [*Why does ridge estimate become better than OLS by adding a constant to the diagonal?*](http://stats.stackexchange.com/questions/118712/why-does-ridge-estimate-become-better-than-ols-by-adding-a-constant-to-the-diago/) — Glen_b, Feb 29 '16 at 08:47

HStamper · Answer 1 · 2016-02-29T20:15:22.067

In the linear model $Y=X\beta + \epsilon$, assuming uncorrelated errors with mean zero and $X$ having full column rank, the least squares estimator $(X^TX)^{-1}X^TY$ is an unbiased estimator for the parameter $\beta$. However, this estimator can have high variance. For example, when two of the columns of $X$ are highly correlated.

The penalty parameter $\lambda$ makes $\hat{w}$ a biased estimator of $\beta$, but it decreases its variance. Also, $\hat{w}$ is the posterior expectation of $\beta$ in a Bayesian regression with a $N(0,\frac{1}{\lambda}I)$ prior on $\beta$. In that sense, we include some information into the analysis that says the components of $\beta$ ought not be too far from zero. Again, this leads us to a biased point estimate of $\beta$ but reduces the variance of the estimate.

In a setting where $X$ high dimensional, say $N \approx p$, the least squares fit will match the data almost perfectly. Although unbiased, this estimate will be highly sensitive to fluctuations in the data because in such high dimensions, there will be many points with high leverage. In such situations the sign of some components of $\hat{\beta}$ can determined by a single observation. The penalty term has the effect of shrinking these estimates towards zero, which can reduce the MSE of the estimator by reducing the variance.

Edit: In my initial response I provided a link to a relevant paper and in my haste I removed it. Here it is: http://www.jarad.me/stat615/papers/Ridge_Regression_in_Practice.pdf

In its current form this is really more of a comment; do you think you could flesh it out into a substantive answer? — Silverfish, Feb 28 '16 at 22:48
The bottom of p. 5 right/top of p. 6 left, pertaining to Figure 3, contains the key discussion for the question asked in this post. — Mark L. Stone, Feb 28 '16 at 23:01
This is all correct, but I am not sure it answers the OP's question. — amoeba, Feb 29 '16 at 09:51
amoeba, see my comment above, which refers to the link which has been subsequently edited out from Eric Mittman's answer, http://www.jarad.me/stat615/papers/Ridge_Regression_in_Practice.pdf . — Mark L. Stone, Feb 29 '16 at 13:05

Matthew Gunn · Answer 2 · 2017-01-25T22:28:19.387

Numerical stability and overfitting are in some sense related but different issues.

The classic OLS problem:

Consider the classic least squares problem:

$$\operatorname*{minimize}(\text{over $\mathbf{b}$}) \quad(\mathbf y-X\mathbf{b})^T(\boldsymbol{y}-X\mathbf{b}) $$

The solution is the classic $\hat{\mathbf{b}} = (X'X)^{-1}(X'\mathbf{y})$. An idea is that by the law of large numbers:

$$ \lim_{n \rightarrow \infty} \frac{1}{n} X'X \rightarrow \mathrm{E}[\mathbf{x}\mathbf{x}'] \quad \quad \quad \lim_{n \rightarrow \infty} \frac{1}{n} X'\mathbf{y} \rightarrow \mathrm{E}[\mathbf{x}y]$$

Hence the OLS estimate $\hat{\mathbf{b}}$ also converges to $\mathrm{E}[\mathbf{x}\mathbf{x}']^{-1}\mathrm{E}[\mathbf{x}y]$. (In linear algebra terms, this is the linear projection of random variable $y$ onto the linear span of random variables $x_1, x_2, \ldots, x_k$.)

Problems?

Mechanically, what can go wrong? What are possible problems?

For small samples, our sample estimates of $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$ may be poor.
If columns of $X$ are collinear (either due to inherent collinearity or small sample size), the problem will have a continuum of solutions! The solution may not be unique.
- This occurs if $\mathrm{E}[\mathbf{x}\mathbf{x}']$ is rank deficient.
- This also occurs if $X'X$ is rank deficient due to small sample size relative to the number of regressor issues.

Problem (1) can lead to overfitting as estimate $\hat{\mathbf{b}}$ starts reflecting patterns in the sample that aren't there in the underlying population. The estimate may reflect patterns in $\frac{1}{n}X'X$ and $\frac{1}{n}X'\mathbf{y}$ that don't actually exist in $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$

Problem (2) means a solution isn't unique. Imagine we're trying to estimate the price of individual shoes but pairs of shoes are always sold together. This is an ill-posed problem, but let's say we're doing it anyway. We may believe the left shoe price plus the right shoe price equals \$50, but how can we come up with individual prices? Is setting left shoe prices $p_l = 45$ and right shoe price $p_r = 5$ ok? How can we choose from all the possibilities?

Introducing $L_2$ penalty:

Now consider:

$$\operatorname*{minimize}(\text{over }\mathbf{b})\quad (\mathbf y-X\mathbf{b})^T(\boldsymbol{y}-X\mathbf{b}) + \lambda\|\boldsymbol{b}\|^2 $$

This may help us with both types of problems. The $L_2$ penalty pushes our estimate of $\mathbf{b}$ towards zero. This functions effectively as a Bayesian prior that the distribution over coefficient values are centered around $\mathbf{0}$. That helps with overfitting. Our estimate will reflect both the data and our initial beliefs that $\mathbf{b}$ is near zero.

$L_2$ regularization also always us to find a unique solution to ill-posed problems. If we know the price of left and right shoes total to $\$50$, the solution that also minimizes the $L_2$ norm is to choose $p_l = p_r = 25$.

Is this magic? No. Regularization isn't the same as adding data that would actually allow us to answer the question. $L_2$ regularization in some sense adopts the view that if you lack data, choose estimates closer to $0$.

Lucid explanation for "numerical stability of matrix inversion" in ridge regression and its role in reducing overfit

2 Answers2

The classic OLS problem:

Problems?

Introducing $L_2$ penalty: