Numerical stability and overfitting are in some sense related but different issues.
The classic OLS problem:
Consider the classic least squares problem:
$$\operatorname*{minimize}(\text{over $\mathbf{b}$}) \quad(\mathbf y-X\mathbf{b})^T(\boldsymbol{y}-X\mathbf{b}) $$
The solution is the classic $\hat{\mathbf{b}} = (X'X)^{-1}(X'\mathbf{y})$. An idea is that by the law of large numbers:
$$ \lim_{n \rightarrow \infty} \frac{1}{n} X'X \rightarrow \mathrm{E}[\mathbf{x}\mathbf{x}'] \quad \quad \quad \lim_{n \rightarrow \infty} \frac{1}{n} X'\mathbf{y} \rightarrow \mathrm{E}[\mathbf{x}y]$$
Hence the OLS estimate $\hat{\mathbf{b}}$ also converges to $\mathrm{E}[\mathbf{x}\mathbf{x}']^{-1}\mathrm{E}[\mathbf{x}y]$. (In linear algebra terms, this is the linear projection of random variable $y$ onto the linear span of random variables $x_1, x_2, \ldots, x_k$.)
Problems?
Mechanically, what can go wrong? What are possible problems?
- For small samples, our sample estimates of $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$ may be poor.
- If columns of $X$ are collinear (either due to inherent collinearity or small sample size), the problem will have a continuum of solutions! The solution may not be unique.
- This occurs if $\mathrm{E}[\mathbf{x}\mathbf{x}']$ is rank deficient.
- This also occurs if $X'X$ is rank deficient due to small sample size relative to the number of regressor issues.
Problem (1) can lead to overfitting as estimate $\hat{\mathbf{b}}$ starts reflecting patterns in the sample that aren't there in the underlying population. The estimate may reflect patterns in $\frac{1}{n}X'X$ and $\frac{1}{n}X'\mathbf{y}$ that don't actually exist in $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$
Problem (2) means a solution isn't unique. Imagine we're trying to estimate the price of individual shoes but pairs of shoes are always sold together. This is an ill-posed problem, but let's say we're doing it anyway. We may believe the left shoe price plus the right shoe price equals \$50, but how can we come up with individual prices? Is setting left shoe prices $p_l = 45$ and right shoe price $p_r = 5$ ok? How can we choose from all the possibilities?
Introducing $L_2$ penalty:
Now consider:
$$\operatorname*{minimize}(\text{over }\mathbf{b})\quad (\mathbf y-X\mathbf{b})^T(\boldsymbol{y}-X\mathbf{b}) + \lambda\|\boldsymbol{b}\|^2 $$
This may help us with both types of problems. The $L_2$ penalty pushes our estimate of $\mathbf{b}$ towards zero. This functions effectively as a Bayesian prior that the distribution over coefficient values are centered around $\mathbf{0}$. That helps with overfitting. Our estimate will reflect both the data and our initial beliefs that $\mathbf{b}$ is near zero.
$L_2$ regularization also always us to find a unique solution to ill-posed problems. If we know the price of left and right shoes total to $\$50$, the solution that also minimizes the $L_2$ norm is to choose $p_l = p_r = 25$.
Is this magic? No. Regularization isn't the same as adding data that would actually allow us to answer the question. $L_2$ regularization in some sense adopts the view that if you lack data, choose estimates closer to $0$.