How does ridge regression actually utilize principal components?

Question

This is what I understand so far:

Let $\mathbf X$ be the centered $n \times p$ predictor matrix and consider its singular value decomposition $\mathbf X = \mathbf{USV}^\top$ with $\mathbf S$ being a diagonal matrix with diagonal elements $s_i$.

The fitted values of ordinary least squares (OLS) regression are given by $$\hat {\mathbf y}_\mathrm{OLS} = \mathbf X \beta_\mathrm{OLS} = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y = \mathbf U \mathbf U^\top \mathbf y.$$ The fitted values of the ridge regression are given by $$\hat {\mathbf y}_\mathrm{ridge} = \mathbf X\beta_\mathrm{ridge} = \mathbf X (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y = \mathbf U\:\mathrm{diag}\left\{\frac{s_i^2}{s_i^2+\lambda}\right\}\mathbf U^\top\mathbf y.$$ The fitted values of the PCA regression (PCR) with $k$ components are given by $$\hat {\mathbf y}_\mathrm{PCR} = \mathbf X_\mathrm{PCA} \beta_\mathrm{PCR} = \mathbf U\: \mathrm{diag}\left\{1,\ldots, 1, 0, \ldots 0\right\}\mathbf U^\top \mathbf y,$$ where there are $k$ ones followed by zeroes.

Then a common interpretation is that the larger the singular value $_$, the less it will be penalized in ridge regression. Small singular values are penalized the most. How exactly is this true? How exactly are they being penalized/what is shrinking/growing? The principal components are in a different space than the vectors of $U$. I certainly see how the singular values relate to the principal components, I'm just not sure how the math relates the singular values to the ridge coefficients.

Re "How exactly are they being penalized/what is shrinking/growing?": isn't that fully explicit in the formula? $\lambda$ is the penalty parameter that "shrinks or grows" and the penalization formula is $s_i^2/(s_i^2+\lambda).$ For an explicit formula for $\hat{\beta}_\text{ridge},$ see the last formula shown in https://stats.stackexchange.com/a/220324/919. You might appreciate the comments that follow it. — whuber, Mar 26 '21 at 20:50
I guess my confusion is when explanations say "the smallest principal components are being penalized". I'm confused on how exactly this penalization translates from points in the principal component space to ridge coefficients. — Victor M, Mar 26 '21 at 21:05
That's precisely what the formula in the post I referenced tells you. — whuber, Mar 26 '21 at 21:30

How does ridge regression actually utilize principal components?

0 Answers0