I have several questions regarding the ridge penalty in the least squares context:
$$\beta_{ridge} = (\lambda I_D + X'X)^{-1}X'y$$
1) The expression suggests that the covariance matrix of X is shrunk towards a diagonal matrix, meaning that (assuming that variables are standardized before the procedure) correlation among input variables will be lowered. Is this interpretation correct?
2) If it is a shrinkage application why is it not formulated in the lines of $(\lambda I_D + (1-\lambda)X'X)$, assuming that we can somehow restrict lambda to [0,1] range with a normalization.
3) What can be a normalization for $\lambda$ so that it can be restricted to a standard range like [0,1].
4) Adding a constant to the diagonal will affect all eigenvalues. Would it be better to attack only the singular or near singular values? Is this equivalent to applying PCA to X and retaining top-N principal components before regression or does it have a different name (since it doesn't modify the cross covariance calculation)?
5) Can we regularize the cross covariance, or does it have any use, meaning $$\beta_{ridge} = (\lambda I_D + X'X)^{-1}(\gamma X'y)$$
where a small $\gamma$ will lower the cross covariance. Obviously this lowers all $\beta$s equally, but perhaps there is a smarter way like hard/soft thresholding depending on covariance value.