Total Sum of Squares, Covariance between residuals and the predicted values

Question

This is more of a follow up question regarding: Confused with Residual Sum of Squares and Total Sum of Squares.

Total sum of squares can be represented as:

$$\displaystyle \sum_i ({y}_i-\hat{y}_i)^2+2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y}) +\sum_i(\hat{y}_i-\bar{y})^2$$

Where:

1st term residual sum of squares
2nd term is the covariance between residuals and the predicted values
3rd term is the explained sum of squares.

There's a few things I don't understand:

Why would a correlation between residuals and predicted values mean there are better values of $\hat y$?
Why is the second term covariance? I've tried to solve it on paper, but I'm getting this extra divide by N (number of data points).

$$2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y})=2\sum_i(y_i \hat y-\bar y_i ^2 + \hat y_i \bar y - y_i \bar y)$$

\begin{align} cov(x, y) & = E[XY]-E[X]E[Y] \\ cov((y_i-\hat y_i), \hat y_i) & = E[(y_i -\hat y_i)(\hat y_i)]-E[(y_i-\hat y_i)]E[\hat y_i] \\ E[\hat y_i] & = \bar y \text{ if perfect prediction} \\ & =E[(y_i-\hat y_i)(\hat y_i)]-E[(y_i-\hat y_i)]\bar y \\ & =E[(y_i-\hat y_i)(\hat y_i)]-E[\bar y(y_i-\hat y_i)] \\ & =E[(y_i\hat y_i-\hat y_i\hat y_i]-E[(y_i \bar y-\hat y_i \bar y)] \\ & =E[(y_i\hat y_i-\hat y_i\hat y_i]+E[-(y_i \bar y-\hat y_i \bar y)] \\ & =E[(y_i\hat y_i-\hat y_i\hat y_i]+E[(-y_i \bar y+\hat y_i \bar y)] \\ & =E[(y_i\hat y_i-\hat y_i\hat y_i-y_i \bar y+\hat y_i \bar y)] \\ & =\frac{\sum_i[(y_i\hat y_i-\hat y_i\hat y_i-y_i \bar y+\hat y_i \bar y)]}{N} \\ \end{align}

From the above computation, $covariance \ne \displaystyle 2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y})$

I think I'm either misinterpreting or doing something incorrect?

In response to: $H$ is a really important matrix and it's worth taking the time to understand it. First, note that it's symmetric (you can prove this by showing $H^T=H$). Then prove it's idempotent by showing $H^22=H$. This all means that $H$ is a projection matrix, and $H$ projects a vector $v∈Rn$ into the p-dimensional subspace spanned by the columns of X. It turns out that $I−H$ is also a projection, and this projects a vector into the space orthogonal to the space that $H$ projects into.

Let's assume $X$ is a 2x2 matrix:

$$ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \end{bmatrix} $$

Then $X^T$: $$ \begin{bmatrix} 1 & 1 \\ x_1 & x_2 \\ \end{bmatrix} $$

Compute $X^TX$

$ \begin{bmatrix} 1 & 1 \\ x_1 & x_2 \\ \end{bmatrix} $ $ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \end{bmatrix} $ $=$ $ \begin{bmatrix} 2 & x_1+x_2 \\ x_1+x_2 & x_1^2+x_2^2 \\ \end{bmatrix} $

Compute $(X^TX)^{-1}$

$ A = \begin{bmatrix} a & b \\ c & d \\ \end{bmatrix} $ $ A^{-1} = \frac{1}{|A|} \begin{bmatrix} d & -b \\ -c & a \\ \end{bmatrix} $ $ A^{-1} = \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \\ \end{bmatrix} $

$ (X^TX)^{-1} = \begin{bmatrix} \frac{x_1^2+x_2^2}{2x_1^2+2x_2-(x^2_1+2x_1x_2+x^2_2)} & \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2))} \\ \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \end{bmatrix} $

Compute $X(X^TX)^{-1}$

$X(X^TX)^{-1} = $ $ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \end{bmatrix} $ $ \begin{bmatrix} \frac{x_1^2+x_2^2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \end{bmatrix} $ $= \begin{bmatrix} \frac{x^2_2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_1-x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \frac{x_1^2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_2-x_1}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \end{bmatrix} $

Compute $X(X^TX)^{-1}X^T$

$X(X^TX)^{-1}X^T = $ $ \begin{bmatrix} \frac{x^2_2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_1-x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \frac{x_1^2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_2-x_1}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\ \end{bmatrix} $ $ \begin{bmatrix} 1 & 1 \\ x_1 & x_2 \\ \end{bmatrix} $ $= \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} $

Is an identity matrix

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

4

I'm going to assume this is all in the context of a linear model $Y = X\beta + \varepsilon$. Letting $H = X(X^T X)^{-1}X^T$, we have fitted values $\hat Y = H Y$ and residuals $e = Y - \hat Y = (I - H)Y$. For the second term in your expression, $$ \sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \langle e, HY - \bar y \mathbb 1\rangle $$ (where $\mathbb 1$ is the vector of all $1$'s and $\langle ., .\rangle$ is the standard inner product) $$ = \langle (I-H)Y, HY - \bar y \mathbb 1\rangle = Y^T (I-H)HY - \bar y Y^T (I-H) \mathbb 1. $$ Assuming we have an intercept in our model, $\mathbb 1$ is in the span of the columns of $X$ so $(I-H)\mathbb 1 = 0$. We also know that $H$ is idempotent so $(I-H)H = H-H^2 = H-H = 0$ therefore $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = 0$.

This tells us that the residuals are necessarily uncorrelated with the fitted values. This makes sense because the fitted values are the projection of $Y$ into the column space, while the residuals are the projection of $Y$ into the space orthogonal to the column space of $X$. These two vectors are necessarily orthogonal, i.e. uncorrelated.

By showing that, under this model, $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = 0$, we have proved that $$ \sum_i(y_i - \bar y)^2 = \sum_i(y_i - \hat y_i)^2 + \sum_i(\hat y_i - \bar y)^2 $$ which is a well-known decomposition.

To answer your question about why correlation between $e$ and $\hat Y$ means there are better values possible, I think you really need to consider the geometric picture of linear regression as shown below, for example:

(taken from random_guy's answer here).

If we have two centered vectors $a$ and $b$, the (sample) correlation between them is $$ cor(a, b) = \frac{\sum_i a_ib_i}{\sqrt{\sum_i a_i^2 \sum b_i^2}} = \cos \theta $$ where $\theta$ is the angle between them. If this is new to you, you can read more about it here.

Linear regression by definition seeks to minimize $\sum_i e_i^2$. Looking at the picture, we can see that this is the squared length of the vector $\hat \varepsilon$, and we know that this length will be the shortest when the angle between $\hat \varepsilon$ and $\hat Y$ is $90^o$ (if that's not clear, imagine moving the point given by the tip of the vector $\hat Y$ in the picture and see what happens to the length of $\hat \varepsilon$). Since $\cos 90^o = 0$ these two vectors are uncorrelated. If this angle is not $90^o$, i.e. $\sum_i e_i \hat y_i \neq 0 \implies \cos \theta \neq 0$, then we don't have the $\hat Y$ that's as close as possible.

To answer your question about how the term $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y)$ is a covariance, you need to remember that this is a sample covariance, not the covariance between random variables. As I showed above, that's always 0. Note that $$ \sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \sum_i ([y_i - \hat y_i] - 0)([\hat y_i] - \bar y). $$ Noting that the sample average of $y_i - \hat y_i = 0$, and the sample average of $\hat y_i = \bar y$, we have that this is a sample covariance by definition.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 13 '17 at 21:32

jld

18,405
2
52
65

Is it possible not to use linear algebra notation? Thanks – user1157751 Mar 13 '17 at 21:35
@user1157751 sure, I'll see what i can do. – jld Mar 13 '17 at 21:37
@user1157751 ok I've added a substantial bit more. Let me know if this still isn't helpful. – jld Mar 13 '17 at 22:40
Thanks!! It's super helpful, I just need some time to digest. – user1157751 Mar 13 '17 at 22:52
Is your equation $H = X(X^T X)^{-1}X^T$ supposed to be $H = Y(X^T X)^{-1}X^T$? – user1157751 Mar 20 '17 at 21:02
@user1157751 no, we have $\hat Y = X \hat \beta$ where $\hat \beta = (X^T X)^{-1}X^T Y$ so $\hat Y = \left[ X (X^T X)^{-1}X^T\right]Y = HY$ for $H = X(X^T X)^{-1}X^T$. – jld Mar 20 '17 at 21:04
Thanks! If you have time, can you explain how it was derived to such form? $ \sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \langle e, HY - \bar y \mathbb 1\rangle $ – user1157751 Mar 21 '17 at 16:49
@user1157751 $\langle u, v \rangle = \sum_i u_iv_i$ so this is just a few steps of substitution. We know $e_i = y_i - \hat y_i$, and $\hat Y = H Y$. We need the vector with elements $\hat y_i - \bar y$ and we can get this by subtracting the vector $\bar y \mathbf 1 = (\bar y, \dots, \bar y)$ from $HY$. Putting this together, we have $$\sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \sum_i e_i ([HY]_i - \bar y) = \langle e, HY - \bar y \mathbf 1 \rangle$$ – jld Mar 21 '17 at 17:12
After drawing pictures of matrices, I was able to get that, thanks! One more question if you don't mind: "Assuming we have an intercept in our model, 1 is in the span of the columns of $X$ so $(I−H)1=0$" how did you get this conclusion? Furthermore how did you know $H$ is idempotent? Trying to understand all of this is stretching my math ability, but I think I'm getting there. – user1157751 Mar 21 '17 at 17:29
@user1157751 [part 1] $H$ is a really important matrix and it's worth taking the time to understand it. First, note that it's symmetric (you can prove this by showing $H^T = H$). Then prove it's idempotent by showing $H^2 = H$. This all means that $H$ is a projection matrix, and $H$ projects a vector $v \in \mathbb R^n$ into the $p$-dimensional subspace spanned by the columns of $X$. It turns out that $I- H$ is also a projection, and this projects a vector into the space orthogonal to the space that $H$ projects into. – jld Mar 21 '17 at 17:46
[part 2] Note how $v = Iv = (I - H + H)v = (I - H)v + Hv$, so we can decompose any vector $v$ into the part in the column space of $X$ and the part orthogonal to the column space of $X$. If $v$ is already in the column space of $X$ then $Hv = v$ and $(I - H)v = v - v = 0$. So if there's an intercept, represented by a constant column in $X$, then $\mathbb 1$ is in the column space of $X$ so $H \mathbb 1 = \mathbb 1$ and $(I - H)\mathbb 1 = 0$ – jld Mar 21 '17 at 17:47
Oops, accidentally deleted the comment, let me rephrase it a bit, why isn't $H$ itself an identify matrix? Since $AA^-1=l$, and we have $H=X^TX(X^TX)^{-1}$, but then $\hat Y = Y$, since $\hat Y = HY$. – user1157751 Mar 21 '17 at 21:53
$H = X(X^T X)^{-1}X^T$, not $X^T X(X^T X)^{-1}$. Matrix multiplication is not in general commutative so these two quantities are not the same (and the second one is indeed $I$). – jld Mar 21 '17 at 21:56
Hmm... Somehow I thought it was commutative, but thinking about it twice, it doesn't make sense if it's commutative. No wonder I have been trying to solve it for 2 Weeks+. Thanks a bunch. – user1157751 Mar 21 '17 at 22:01
Matrices are linear transformations so something like $ABv$ is like function composition $f(g(v))$. With functions we would never expect $f(g(v)) = g(f(v))$ for arbitrary $f$ and $g$. In general, it's remarkable when things do commute rather than being weird when they don't. Glad this helped! – jld Mar 21 '17 at 22:07
I tried to do compute a 2x2 H Matrix, however, I the result does not look like a projection matrix? Results are edited in the question. – user1157751 Mar 23 '17 at 17:50
@user1157751 I haven't checked your exact algebra but there's definitely some cancelation that you could do. Either way, though, I would never actually compute $H$ and look at the numbers in it. I don't know what the numbers in a general projection matrix should look like. Just based on its construction we know that $H$ is a projection matrix and what space it projects to. We know its rank and its spectrum, all without actually computing it for a particular $X$. I would try to understand it through that lens, agnostic to $X$, rather than by actually computing it – jld Mar 23 '17 at 18:09
After redoing the equations, it looks like H is an identity matrix, which is symmetric, and $H^2=H$, but it also means that $\hat Y = HY$, then $\hat Y = IY$, so $\hat Y = Y$. Doesn't sound correct? – user1157751 Mar 23 '17 at 22:01
@user1157751 if you've still got questions about this, i think it'd be worth asking a new question. These are all good questions but it's pretty far from the original question that we're commenting on – jld Mar 27 '17 at 16:40
Yeah, I will probably do that next. I'm not sure what a projection matrix is, but still working on it. – user1157751 Mar 27 '17 at 16:42
You've been really helpful to me, really appreciated it!! Many thanks. – user1157751 Mar 27 '17 at 16:59

Total Sum of Squares, Covariance between residuals and the predicted values

1 Answers1