Edit: I think my old answer is a bit inaccurate.
1st of all - regarding my question - $(X^TX)^{-1}$ is obviously not a projection matrix. By the mere fact that $P^2 \neq P$.
2nd thing - there's seem to be a bit of confusion because of the standardizing/normalizing stuff. If I regress only 1 covariate w/o intercept, I get $\hat\beta_j=(x^Tx)^{-1}x^Ty$. If I do this to all the covariates separately, I will get what I called the "individual regression", i.e., in matrix form:
$$\hat\beta_{ind}=\begin{pmatrix} x_1^Tx_1 & \dots &0 \\
\vdots &\ddots & \vdots \\
0 & \dots & x_p^Tx_p
\end{pmatrix}^{-1}X^Ty$$
I.e., it's as if we are assuming that the covariance between the covariates is 0 = they are uncorrelated. Which in realty, of course, is not true.
Compare this to the full regression which doesn't assume this:
$$\hat\beta=\begin{pmatrix} x_1^Tx_1 & \dots &x_1^Tx_p \\
\vdots &\ddots & \vdots \\
x_p^Tx_1 & \dots & x_p^Tx_p
\end{pmatrix}^{-1}X^T y
$$
I'm not sure it's possible to break this down to some matrix time $\hat\beta_{ind}$...
In the case where we standardize the columns of $X$, then this is possible. $\hat\beta_{ind}$ is reduced to $\frac{1}{n}X^Ty$ and $\hat\beta$ can be written as $(\frac{1}{n}X^TX)^{-1}\hat\beta_{ind}$
Old post:
So, this is what I think:
- $X^Ty$ finds the individual regression, if $X$ is normalized. It is also the complete regression in an orthogonal design (i.e., if $X^TX=I$).
- $(X^TX)^{-1}$ is actually normalizing the $X$'s anyway, i.e., $(X^TX)^{-1}X^T y$ will be normalized. You can see this clearly if you take the columns of $X$ to be orthogonal but not orthonormal. $X^TX$ will be a diagonal matrix but without 1's in the diagonal. Taking the inverse of that, and multiplying that by $X^Ty$ we get again the individual regression.
- This means that if $X$ has no correlation between the features, and is normalized, than $X^Ty$ reveals the coefficients.
- If $X$ has features with a positive correlation, then $X(X^TX)^{-1}$ has negative correlation. And vice versa.
- I would expect that $(X^TX)^{-1}$ also serves to de-correlate the structure of the $X$'s to a new space $X^*=X(X^TX)^{-1}$, and in this new space we use individual regression to recover the coefficients.
- The thing that bothers me is why isn't ${X^*}^TX^*=I$?
- Maybe it's a 2-way trip - $(X^TX)^{-1}X^T y$ goes to this new space, preforms individual regression there, and then comes back. Perhaps using SVD-decomposition we can see this?
$$X = UDV'\Rightarrow (X^TX)^{-1}X^T y = V(D)^{-1}U'y
$$
where $U'y$ is the individual regression for the $U$'s, $D^{-1}$ is the normalization, and $V$ is the projection back?
- It is true that if you regress $U$ to $y$ you get individual regression = regular regression, which is not so surprising given that the columns of $U$ are orthonormal.
So in the end the difference between component wise regression, $\hat\beta_{ind} = VDU'y$ and normal regression $\hat\beta = VD^{-1}U'y$ - is the that the $D$ is inverted.