Interpretation of ridge regularization in regression

Question

I have several questions regarding the ridge penalty in the least squares context:

$$\beta_{ridge} = (\lambda I_D + X'X)^{-1}X'y$$

1) The expression suggests that the covariance matrix of X is shrunk towards a diagonal matrix, meaning that (assuming that variables are standardized before the procedure) correlation among input variables will be lowered. Is this interpretation correct?

2) If it is a shrinkage application why is it not formulated in the lines of $(\lambda I_D + (1-\lambda)X'X)$, assuming that we can somehow restrict lambda to [0,1] range with a normalization.

3) What can be a normalization for $\lambda$ so that it can be restricted to a standard range like [0,1].

4) Adding a constant to the diagonal will affect all eigenvalues. Would it be better to attack only the singular or near singular values? Is this equivalent to applying PCA to X and retaining top-N principal components before regression or does it have a different name (since it doesn't modify the cross covariance calculation)?

5) Can we regularize the cross covariance, or does it have any use, meaning $$\beta_{ridge} = (\lambda I_D + X'X)^{-1}(\gamma X'y)$$

where a small $\gamma$ will lower the cross covariance. Obviously this lowers all $\beta$s equally, but perhaps there is a smarter way like hard/soft thresholding depending on covariance value.

iirc the ridge penalty comes from a restriction that $\sum \beta^2 \leq T$, by way of a Lagrange multiplier on the MSE objective function. LASSO is the same but with $|\beta|$ instead. I'm on my phone so I can't easily post a derivation at the moment. But these are great questions — shadowtalker, Dec 22 '14 at 15:31

score 21 · Accepted Answer · edited Apr 13 '17 at 12:44

Good questions!

Yes, this is exactly correct. You can see ridge penalty as one possible way to deal with multicollinearity problem that arises when many predictors are highly correlated. Introducing ridge penalty effectively lowers these correlations.
I think this is partly tradition, partly the fact that ridge regression formula as stated in your first equation follows from the following cost function: $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta\|^2.$$ If $\lambda=0$, the second term can be dropped, and minimizing the first term ("reconstruction error") leads to the standard OLS formula for $\beta$. Keeping the second term leads to the formula for $\beta_\mathrm{ridge}$. This cost function is mathematically very convenient to deal with, and this might be one of the reasons for preferring "non-normalized" lambda.
One possible way to normalize $\lambda$ is to scale it by the total variance $\mathrm{tr}(\mathbf X^\top \mathbf X)$, i.e. to use $\lambda \mathrm{tr}(\mathbf X^\top \mathbf X)$ instead of $\lambda$. This would not necessarily confine $\lambda$ to $[0,1]$, but would make it "dimensionless" and would probably result in optimal $\lambda$ being less then $1$ in all practical cases (NB: this is just a guess!).
"Attacking only small eigenvalues" does have a separate name and is called principal components regression. The connection between PCR and ridge regression is that in PCR you effectively have a "step penalty" cutting off all the eigenvalues after a certain number, whereas ridge regression applies a "soft penalty", penalizing all eigenvalues, with smaller ones getting penalized more. This is nicely explained in The Elements of Statistical Learning by Hastie et al. (freely available online), section 3.4.1. See also my answer in Relationship between ridge regression and PCA regression.
I have never seen this done, but note that you could consider a cost function in the form $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta-\beta_0\|^2.$$ This shrinks your $\beta$ not to zero, but to some other pre-defined value $\beta_0$. If one works out the math, you will arrive to the optimal $\beta$ given by $$\beta = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} (\mathbf X^\top \mathbf y + \lambda \beta_0),$$ which perhaps can be seen as "regularizing cross-covariance"?

Could you explain why adding $\lambda I_D$ to $X'X$ means that the covariance matrix of $X$ is shrunk toward a diagonal matrix? This is a purely linear algebra question I suppose. — Heisenberg, Dec 22 '14 at 20:19
@Heisenberg, well, $X^\top X$ is the covariance matrix of $X$ (up to a $1/N$ scaling factor). Computing $\beta$ requires inverting this covariance matrix. In ridge regression, we invert $X^\top X + \lambda I$ instead, so one can see $X^\top X + \lambda I$ as a regularized estimation of the covariance matrix. Now the term $\lambda I$ is a diagonal matrix with $\lambda$ on the diagonal. Imagine that $\lambda$ is very large; then the sum is dominated by the diagonal term $\lambda I$, and so the regularized covariance becomes more and more diagonal as $\lambda$ grows. — amoeba, Dec 22 '14 at 21:43
wrt Q5, Elements of Statistical Learning looks at smoothness constraints for image processing applications (PDA - page 447) — seanv507, Apr 28 '15 at 23:55

score 13 · Answer 2 · edited Dec 24 '14 at 17:44

A further comment on question 4. Actually, ridge regression does pretty effectively deal with the small eigenvalues of $X^{T}X$ while mostly leaving the large eigenvalues alone.

To see this, express the ridge regression estimator in terms of the singular value decomposition of $X$,

$$X=\sum_{i=1}^{n} \sigma_{i}u_{i}v_{i}^{T}$$

where the $u_{i}$ vectors are mutually orthogonal and the $v_{i}$ vectors are also mutually orthogonal. Here the eigenvalues of $X^{T}X$ are $\sigma_{i}^{2}$, $i=1, 2, \ldots, n$.

Then you can show that

$$\beta_{\mbox{ridge}}=\sum_{i=1}^{n} \frac{\sigma_{i}^{2}}{\sigma_{i}^{2}+\lambda}\frac{1}{\sigma_{i}} (u_{i}^{T}y) v_{i}.$$

Now, consider the "filter factors" $\sigma_{i}^{2}/(\sigma_{i}^{2}+\lambda)$. If $\lambda=0$, then the filter factors are 1, and we get the conventional least squares solution. If $\lambda > 0$ and $\sigma_{i}^{2} \gg \lambda$, then the filter factor is essentially 1. If $\sigma_{i}^{2} \ll \lambda$, then this factor is essentially 0. Thus the terms corresponding to the small eigenvalues effectively drop out, while those corresponding to the larger eigenvalues are retained.

In comparison, principal components regression simply uses factors of 1 (for the larger eigenvalues) or 0 (for the smaller eigenvalues that are dropped) in this formula.

That is exactly what I briefly referred to in my answer, but it is very nice to have it elaborated and demonstrated mathematically, +1. — amoeba, Dec 24 '14 at 17:46

Vincent Guillemot · Answer 3 · 2014-12-22T17:58:17.937

Questions 1, 2 and 3 are linked. I like to think that yes, introducing a Ridge penalty in a linear regression model can be interpreted as a shrinkage on the eigen-values of $X$. In order to make this interpretation, one has first to make the assumption that $X$ is centered. This interpretation is based on the following equivalence: $$ \lambda x + y = \kappa \left( \alpha x + (1-\alpha) y\right), $$ with $\alpha=\frac{\lambda}{1+\lambda}$ and $\kappa = 1+\lambda$. If $0 \leq \lambda < + \infty$, it immediately follows that $0 < \alpha \leq 1$.

The technique you describe as "attack[ing] only the singular or near singular values" is also known as Singular Spectrum Analysis (for the purpose of linear regression) (see Eq. 19), if by "attacking", you mean "removing". The cross-covariance is unchanged.

Removing low singular values is also done by Principal Component Regression. In PCR, a PCA is performed on $X$ and a linear regression is applied on a selection of the obtained components. The difference with SSA is that it has an impact on the cross-covariance.

Thank you. In PCR covariance with y is calculated after the reduction of dimension is performed, no? Is that the difference between PCR and SSA? Your gamma (not mine), how do you select that so alpha will be [0,1] bounded? — Cagdas Ozgenc, Dec 22 '14 at 16:44
Sorry about this confusing $\gamma$, I'm replacing it by a $\kappa$. — Vincent Guillemot, Dec 22 '14 at 17:12
I think you are correct about the difference between SSA and PCR, we should write it down to be sure, though. — Vincent Guillemot, Dec 22 '14 at 17:18

Interpretation of ridge regularization in regression

3 Answers3

Linked