0

In the question: How to derive the ridge regression solution? there is a solution by whuber, which describes how the columns of the augmented matrix approach pairwise orthogonality as the regularization strength increases. However, I am not able to reproduce this argument in the following example. Can someone explain what is incorrect or missing?

Suppose the original design matrix is $$A = \begin{pmatrix}1 & 2 \\ 1 & 2 \end{pmatrix},$$ so $rank(A) = 1.$ Further, suppose the augmented design matrix is $$B = \begin{pmatrix}1 & 2 \\ 1 & 2 \\ \nu & 0 \\ 0 & \nu \end{pmatrix},$$ where $\nu^{2} = \lambda$ is the regularization strength, so $rank(B) = 2.$ Then, the columns of $B$ are linearly independent, whereas the columns of $A$ are linearly dependent.

Now, the inner product is the standard inner product on $\mathbb{R}^{4},$ so we may take the transpose of the first column of $B$ times the second column of $B$ to obtain: $$\begin{pmatrix} 1 & 1 & \nu & 0\end{pmatrix} \begin{pmatrix} 2 \\ 2 \\ 0 \\ \nu \end{pmatrix} = 4.$$ However, this inner product is nonzero, so the columns are not pairwise orthogonal.

sunspots
  • 109
  • 3
  • Hint: when $\nu=10^{20},$ use your favorite statistical computing software to measure the angle between the two columns of $B.$ What is it? – whuber Jun 03 '21 at 21:17
  • @whuber An angle of $\frac{\pi}{2}$ is not the same as orthogonal. The latter means a bilinear form vanishes; see https://en.wikipedia.org/wiki/Orthogonality. However, the normalized dot product is not a bilinear form; see my remarks below Alex's solution. – sunspots Jun 04 '21 at 00:52
  • You are confused, because the context is that of *approaching* a right angle. You therefore need a concept of *nearness* to orthogonality. That is afforded by the measure of the angle. – whuber Jun 04 '21 at 02:35
  • The confusion is your abuse of the word orthogonal. The definition of orthogonal is given in the aforementioned wiki link, and it is clearly in terms of a bilinear form. Instead, one should say the similarity between the columns vanishes with increasing regularization strength; https://en.wikipedia.org/wiki/Cosine_similarity. – sunspots Jun 04 '21 at 02:44
  • Referring only to wikipedia for information about conventional uses of mathematical terms is too limited. In the context in which I originally used the term "orthogonal" its meaning is clear and unambiguous. – whuber Jun 04 '21 at 11:35
  • Orthogonality requires linearity, however; what you described without reference is nonlinear. Here are some references: http://people.math.harvard.edu/~mjh/northwestern.pdf, chapter 6, section 2 of Halmos's book https://download.tuxfamily.org/openmathdep/algebra_linear/Finite_Vector_Spaces-Halmos.pdf, chapter 6, section 1 in Linear Algebra by Friedberg et al., etc. – sunspots Jun 04 '21 at 12:32
  • I view your continued comments to be just trolling, because they insist on not understanding any of the material either of us have referenced. Thus, this is the end of the conversation for me. – whuber Jun 04 '21 at 13:39
  • Wikipedia wasn't sufficient, so I shared some nice resources on orthogonality. What is the matter with sharing? I mainly leave these resources, not for you, but for anyone who wishes to understand this problem in a rigorous way. – sunspots Jun 05 '21 at 02:21
  • Your claim of "nonlinearity" is specious. Anyone who wishes to know how the term "orthogonal" is used in this context from a rigorous modern standpoint can visit https://stats.stackexchange.com/a/66295/919. – whuber Jun 06 '21 at 13:44
  • $arccosine$ is not linear; see it's Taylor series https://proofwiki.org/wiki/Power_Series_Expansion_for_Real_Arccosine_Function. What you link to can be found in chapter 6 section 3 of Linear Algebra by Friedberg et al https://anujitspenjoymath.files.wordpress.com/2018/08/stephen_h-_friedberg_2c_arnold_j-_insel.pdf., where the derivation is done in terms of the adjoint, so it holds for the fields $\mathbb{R}$ or $\mathbb{C}.$ – sunspots Jun 07 '21 at 00:52
  • As for the augmented design matrix, the idea begins with virtual examples, as introduced by Yaser Abu-Mostafa https://direct.mit.edu/neco/article/7/4/639/5886/Hints. Then, it continues with exercise 3.12 in the elements of statistical learning by Hastie et al. https://web.stanford.edu/~hastie/Papers/ESLII.pdf. – sunspots Jun 07 '21 at 00:52

2 Answers2

3

Take $A^TA$:

$$A^TA = \begin{pmatrix}1 & 1 \\ 2 & 2 \end{pmatrix}\begin{pmatrix}1 & 2 \\ 1 & 2 \end{pmatrix}=\begin{pmatrix}2 & 4 \\ 4 & 8 \end{pmatrix},$$

and compare with $B^TB$:

$$B^TB = \begin{pmatrix}1 & 2 \\ 1 & 2 \\ \nu & 0 \\ 0 & \nu \end{pmatrix}^T\begin{pmatrix}1 & 2 \\ 1 & 2 \\ \nu & 0 \\ 0 & \nu \end{pmatrix}\\= \begin{pmatrix} 1 & 1 & \nu & 0 \\2 & 2 & 0 & \nu \end{pmatrix}\begin{pmatrix}1 & 2 \\ 1 & 2 \\ \nu & 0 \\ 0 & \nu \end{pmatrix} =\\ \begin{pmatrix}2 + \nu^2 & 4 \\ 4 & 8 + \nu^2 \end{pmatrix} $$

The bigger $\nu$ is, the more $B^TB$ resembles an (scaled) identity matrix.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • How do the columns of the Gramian matrix $B^{t}B,$ which are elements of $\mathbb{R}^{2},$ relate to the columns of $B,$ which are elements of $\mathbb{R}^{4}?$ – sunspots Jun 02 '21 at 13:16
  • 1
    +1. With your calculation in hand, we can even quantify how close $B^\prime B$ comes to a multiple of the identity, because (obviously) $$B^\prime B = \nu^2\left[\pmatrix{1&0\\0&1} + O(\nu^{-2})\right].$$ – whuber Jun 03 '21 at 21:11
2

The claim in whuber's answer that the vectors are becoming "more orthogonal" is ambiguous. I would take it to mean that the correlation is getting closer to $0$ as $\nu$ gets bigger. In $A$, the correlation of the columns is $1$. In $B$, the centered correlation is given by $$\frac{-\nu^2/4 - 3\nu/2 + 2}{\sqrt{(3\nu^2 / 4 - \nu + 1)(3\nu^2/4 - 2\nu + 4)}},$$ which approaches $-1/3$ from below as $\nu \rightarrow \infty$.

Using the definition that the orthogonality of columns $c_1, c_2$ is $$\frac{c_1 \cdot c_2}{\sqrt{|c_1|^2 |c_2|^2}},$$ we have that the orthogonality of the columns of $A$ is 1, and the orthogonality of the columns of $B$ is $$\frac{4}{\sqrt{(1^2 + 1^2 + \nu^2)(2^2 + 2^2 + \nu^2)}} = \frac{4}{\sqrt{(2 + \nu^2)(8 + \nu^2)}},$$ which decreases from $1$ to $0$ as $\nu$ increases from $0$ to $\infty$.

Alex
  • 617
  • 3
  • 7
  • I'm not sure that I follow, as I would expect a covariance (or correlation) matrix for either $A$ or $B.$ For $A,$ either matrix is in $M_{2 \times 2} (\mathbb{R}),$ and for $B,$ either matrix is in $M_{4 \times 4} (\mathbb{R}).$ – sunspots Jun 02 '21 at 13:08
  • 1
    (1) Your calculation cannot possibly be correct for (B), because it is obvious that as $\nu$ grows the other terms can be treated as very small, whence the correlation must approach $0.$ (2) Re "ambiguous:" If orthogonality is not measured in terms of the normalized dot product (or its inverse cosine, which is the angle between the vectors), then what alternative do you have in mind? What measure of orthogonality would not agree with either of these insofar as it establishes a natural amount of orthogonality? – whuber Jun 02 '21 at 13:09
  • @sunspots In both cases there are two columns and therefore there is just one correlation coefficient for them. – whuber Jun 02 '21 at 13:10
  • 1
    @whuber I don't follow, see the solution by user1551 (there is an associated covariance or correlation matrix): https://math.stackexchange.com/questions/2624760/is-the-determinant-of-a-covariance-matrix-always-zero – sunspots Jun 02 '21 at 13:18
  • 1
    @sunspots The covariance matrix has three independent entries, but the correlation matrix merely contains a correlation coefficient. Since in either case there are only two columns, "$M_{4\times4}(\mathbb{R})$" is not a relevant space. It looks like you might be confusing rows with columns. – whuber Jun 02 '21 at 13:33
  • @whuber (1) Well my calculation matches what I obtain by trying large values in R: `cor(c(1, 1, 10^20, 0), c(2, 2, 0, 10^20))` gives `-0.3333333` as the output. (2) I agree that the normalised dot product is the natural measure of orthogonality, but I think the questioner assumed that you meant the dot product. – Alex Jun 02 '21 at 13:35
  • 1
    `cor` is not the appropriate function to be using for this calculation because it initially centers the vectors before computing the dot products. – whuber Jun 02 '21 at 13:39
  • @whuber my calculation is based on centering the data before computing the dot product. I agree that if you don't center it then the correlation goes to $0$. – Alex Jun 02 '21 at 13:43
  • Centering is not appropriate in this context. It could make sense if the design matrix for ridge regression were to contain a constant column: but by construction, it does not. – whuber Jun 02 '21 at 13:45
  • It seems the starting point is let $A$ and $B$ be centered. Then, the covariance matrices are $A^{t}A$ and $B^{t}B,$ as computed by Firebug. Subsequently, the correlation matrices belong to $M_{2\times×2}(R).$ Now what you are referring to as the correlation is the $1,2$ or $2,1$ entry in these symmetric matrices. For the augmented case, the quantity is computed by Alex. However, this computation does not imply that the columns are orthogonal. – sunspots Jun 03 '21 at 01:50
  • This follows, since the normalized dot product is not an inner product. Namely, it does not satisfy all of the three properties, which an inner product must satisfy. Instead, one could call the normalized dot product a similarity measure. In this case, we could conclude that the similarity goes to $0$ as $\nu$ tends towards infinity, as shown by Alex. – sunspots Jun 03 '21 at 01:50
  • To see that $= \frac{x \cdot y}{\sqrt{|x|^{2}|y|^{2}}}$ does not define an inner product, check whether $=a,$ where $x,y$ are vectors and $a$ is a scalar. – sunspots Jun 03 '21 at 01:52