2

I'm afraid the answer is embarrassingly obvious, but here it goes... I was playing with R trying to get "giant" (Prof. Strang's word when explaining penalized regression) inverses of $A^\top A$ (/a-transpose-a/, Gram model matrices) in the presence of highly co-linear regressors. I remember the relationship of the inverse of $A^\top A$ to the variance of the parameter estimates - a direct relationship $\text{Var} (\hat \beta) = \sigma^2 \left(A^\top A \right)^{-1},$ indicating that the high variance of the estimates in the presence of collinearity is related to high values in the inverse of the $A^\top A$ matrix. Of course this is addressed on the site:

If two or more columns of $A$ are highly correlated, one or more eigenvalue(s) of $A^\top A$ is close to zero and one or more eigenvalue(s) of $(A^\top A)^{−1}$ is very large.

Yet, to my surprise, it was $A^\top A,$ and not $(A^\top A)^{-1},$ the matrix with huge eigenvalues.

The toy model is trying to predict the yearly income based on paid income taxes and weekend expenses, and all variables are highly correlated:

$$\text{income} \sim \text{income taxes} + \text{money spent on weekends}$$

# The manufacturing of the toy dataset with 100 entries
weekend_expend = runif(100, 100, 2000)
income = weekend_expend * 100 + runif(100, 10000, 20000)
taxes = 0.4 * income + runif(100, 10000, 20000)
df = cbind(income, taxes, weekend_expend)
pairs(df)

enter image description here

> summary(mod <- lm(income ~ weekend_expend + taxes))

Call:
lm(formula = income ~ weekend_expend + taxes)

Residuals:
    Min      1Q  Median      3Q     Max 
-5337.7 -1885.9   165.8  2028.1  5474.6 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    5260.14790 1656.95983   3.175              0.00201 ** 
weekend_expend   81.55490    3.07497  26.522 < 0.0000000000000002 ***
taxes             0.46616    0.07543   6.180         0.0000000151 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2505 on 97 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981 
F-statistic: 2.551e+04 on 2 and 97 DF,  p-value: < 0.00000000000000022

> # The model matrix is of the form...
> head(A <- model.matrix(mod))
  (Intercept) weekend_expend    taxes
1           1      1803.8237 92743.93
2           1       441.6305 33697.32
3           1       379.0888 36401.24
4           1      1129.1074 65869.23
5           1       558.3715 36708.88
6           1      1790.5604 92750.60
>
> And the A transpose A is...
> (A_tr_A <- t(A) %*% A)
               (Intercept) weekend_expend        taxes
(Intercept)          100.0       113189.2      6632490
weekend_expend    113189.2    159871091.4   8788158840
taxes            6632489.5   8788158839.9 492672410430
>
> ... with its inverse...
> (inv_A_tr_A <- solve(A_tr_A))
                  (Intercept)    weekend_expend               taxes
(Intercept)     0.43758617285  0.00072025324389 -0.0000187385886210
weekend_expend  0.00072025324  0.00000150703080 -0.0000000365782573
taxes          -0.00001873859 -0.00000003657826  0.0000000009067669
> 
> The eigenvalues of the A transpose A are...
> eigen(A_tr_A)$values
[1] 492829172338.305359      3109280.897155            2.285258
>
> "Huge" as compared to the eigenvalues of its transposed...
> eigen(inv_A_tr_A)$values
[1] 0.437587359169068602 0.000000321617773712 0.000000000002029101

The maximum eigenvalue of $A^\top A$ is $492829172338$ while for $(A^\top A)^{-1}$ we get eigenvalues as low as $0.000000000002029101.$

I was expecting the opposite to be the case: Much higher eigenvalues for the inverse of $A^\top A.$ So is this result spurious, or am I missing something critical?

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • 1
    OK, cool. Saw the clip. I am pretty sure that Prof. Strong refers to the condition number when he says *$A^T A$ has a giant inverse*" as he immediately qualifies this by saying "*the matrix $A$ is badly conditioned.*" in both cases of the examples you created, the condition number (ratio of largest to smallest eigenvalue) is `~2e11` which is rather large. Also note that if you perform`A – usεr11852 Jul 30 '20 at 23:13
  • 1
    Matrix $A$ is badly conditioned because of the condition number not the sheer magnitude of the numbers (usually at least, unless the columns are at vastly different scales too but that's another game). – usεr11852 Jul 30 '20 at 23:15
  • @usεr11852 Can you dumb it down a bit? I see the overall problem, but it seems key that it is the inverse of a transpose a the matrix with high eigenvalues so that the variance of the estimates turns out high... – Antoni Parellada Jul 30 '20 at 23:17
  • @usεr11852 ... although now it makes already a lot of sense... The "condition number"!!! Thank you very much, BTW! – Antoni Parellada Jul 30 '20 at 23:18
  • The man is an artist. You got to go with the flow. (+1 obviously to the question cause it's fun) – usεr11852 Jul 30 '20 at 23:19
  • 1
    The inverse of a matrix with eigenvalues $[3, 2, 1]$ has eigenvalues $[1/1, 1/2, 1/3]$. The condition number is still the same, $3$. Now I think the crux is that you say: "*high variance of the estimates in the presence of colinearity (typo) is related to high values in the inverse of the $A^TA$ matrix*", the values are not is not the root issue of the high variance. It is that the condition number as it suggest that a small change in the inputs (the explanatory variables) there will be a large change in the answer or dependent variable. – usεr11852 Jul 30 '20 at 23:22
  • @usεr11852 How can I explain to myself the quote above: If two or more columns of $A$ are highly correlated, one or more eigenvalue(s) of $A^\top A$ is close to zero and one or more eigenvalue(s) of $(A^\top A)^{−1}$ is very large. – Antoni Parellada Jul 30 '20 at 23:26
  • $A^TA$ will be have an eigenvalue close to zero (i.e. an (orthogonal_ dimension with very little magnitude of variation) because the original matrix $A$ will have a very uninformative column; the column $x_1$ that it will be correlated with column $x_2$. In that way when we will take the inner product of $A$ we will end up with a degenerate kernel (matrix in this case) $A^TA$. Also to see a 0-th eigenvalue try `eigen(cov(A))` which will effectively scale the columns of $A$ first. – usεr11852 Jul 30 '20 at 23:34
  • @usεr11852 But $A^TA$ doesn't have an eigenvalue close to zero: $492829172338, 3109280, 2$ are the eigenvalues. What am I missing? – Antoni Parellada Jul 30 '20 at 23:39
  • 1
    Edited my answer on that. :) You can also try `svd(scale(A,scale = FALSE))$d` and see that just centring the values will do the trick. – usεr11852 Jul 30 '20 at 23:39
  • @usεr11852 Is $2$ considered the eigenvalue close to zero? – Antoni Parellada Jul 30 '20 at 23:41
  • @usεr11852 So basically we see this effect only when the mean of each column is centered at zero? Even so, the e-values of the inverse should be larger than the original... – Antoni Parellada Jul 30 '20 at 23:43
  • 1
    Yes. Also note that eigenvalues are "relative to each other". So, yes, if the one is `492829172338` and the other is `2`, `2` is "close to zero" as realistically if we normalised the "large one" to be at scale $1$ the other one will be at scale 10^{-12}. – usεr11852 Jul 30 '20 at 23:44
  • @usεr11852 Thank you very much. What about the inverse having to have larger e-values than a transpose a? I guess we are trying to invert a matrix with a virtually zero e-value (after centering)... Makes sense... – Antoni Parellada Jul 30 '20 at 23:45
  • @usεr11852 It would be terrific to have all these concepts in a formal answer :-) – Antoni Parellada Jul 30 '20 at 23:48
  • 1
    Pleasure Antoni! It is a bit late here and I need to something else but I can sort this out tomorrow. :) – usεr11852 Jul 30 '20 at 23:49

1 Answers1

1

Particular to the video segment linked Prof. Strong refers to the matrix condition number when he says "$A^TA$ has a giant inverse" as he immediately qualifies this by saying "the matrix $A$ is badly conditioned". Please note that condition number relates to the magnitude of the eigenvalues in the original matrix $A^TA$. That means that the concept of a "small/large eigenvalue" is purely relative. In the example provided, if the largest eigenvalue $\lambda_1$ is 492829172338 and the smallest eigenvalue $\lambda_3$ is 2, 2 is "close to zero" because if we normalised $\lambda_1$ to be unit scale, $\lambda_3$ will be at scale $10^{-12}$.

Now regarding the inverse $(A^TA)^{-1}$: The condition number of a matrix $B$ and the inverse of it $B^{-1}$ (given $B^{-1}$ exists of course) is the same. For example if the $B$ has eigenvalues $[3,2,1]$, $B^{-1}$ will have eigenvalues $[1/1,1/2,1/3]$. The condition number is still the same. Cleve Moller's blog-post on What is the Condition Number of a Matrix? is an excellent conversational take on this. Notice that this relates directly to what is mention as: "high variance of the estimates in the presence of collinearity is related to high values in the inverse of the $A^TA$ matrix"; the high values are not the root issue of the high variance in themselves. It is that the condition number as it suggest that for a small change in the inputs (the explanatory variables) we will have a large change in our response variable.

Finally, as in regards to the side question: "(Why) if two or more columns of $A$ are highly correlated, one or more eigenvalue(s) of $A^TA$ is close to zero (...)?" As mentioned, this relates to the original matrix having a very uninformative column (as one of them will just a rescaled version of another column) and therefore the columns of $A$ are not linearly independent. This column-space deficiency causes $A^TA$ to be what we call degenerate (or singular) matrix. I started writing more on this but I saw that ttnphns has given an absolute unit of an answer in the thread: What correlation makes a matrix singular and what are implications of singularity or near-singularity?.

usεr11852
  • 33,608
  • 2
  • 75
  • 117