39

I would like to understand why, under the OLS model, the RSS (residual sum of squares) is distributed $$\chi^2\cdot (n-p)$$ ($p$ being the number of parameters in the model, $n$ the number of observations).

I apologize for asking such a basic question, but I seem to not be able to find the answer online (or in my, more application oriented, textbooks).

Mooncrater
  • 737
  • 2
  • 9
  • 19
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
  • 6
    Note that the answers demonstrate the assertion is not quite right: the distribution of RSS is $\sigma^2$ (not $n-p$) times a $\chi^2(n-p)$ distribution where $\sigma^2$ is the true variance of the errors. – whuber Nov 21 '13 at 19:40

3 Answers3

50

I consider the following linear model: ${y} = X \beta + \epsilon$.

The vector of residuals is estimated by

$$\hat{\epsilon} = y - X \hat{\beta} = (I - X (X'X)^{-1} X') y = Q y = Q (X \beta + \epsilon) = Q \epsilon$$

where $Q = I - X (X'X)^{-1} X'$.

Observe that $\textrm{tr}(Q) = n - p$ (the trace is invariant under cyclic permutation) and that $Q'=Q=Q^2$. The eigenvalues of $Q$ are therefore $0$ and $1$ (some details below). Hence, there exists a unitary matrix $V$ such that (matrices are diagonalizable by unitary matrices if and only if they are normal.)

$$V'QV = \Delta = \textrm{diag}(\underbrace{1, \ldots, 1}_{n-p \textrm{ times}}, \underbrace{0, \ldots, 0}_{p \textrm{ times}})$$

Now, let $K = V' \hat{\epsilon}$.

Since $\hat{\epsilon} \sim N(0, \sigma^2 Q)$, we have $K \sim N(0, \sigma^2 \Delta)$ and therefore $K_{n-p+1}=\ldots=K_n=0$. Thus

$$\frac{\|K\|^2}{\sigma^2} = \frac{\|K^{\star}\|^2}{\sigma^2} \sim \chi^2_{n-p}$$

with $K^{\star} = (K_1, \ldots, K_{n-p})'$.

Further, as $V$ is a unitary matrix, we also have

$$\|\hat{\epsilon}\|^2 = \|K\|^2=\|K^{\star}\|^2$$

Thus

$$\frac{\textrm{RSS}}{\sigma^2} \sim \chi^2_{n-p}$$

Finally, observe that this result implies that

$$E\left(\frac{\textrm{RSS}}{n-p}\right) = \sigma^2$$


Since $Q^2 - Q =0$, the minimal polynomial of $Q$ divides the polynomial $z^2 - z$. So, the eigenvalues of $Q$ are among $0$ and $1$. Since $\textrm{tr}(Q) = n-p$ is also the sum of the eigenvalues multiplied by their multiplicity, we necessarily have that $1$ is an eigenvalue with multiplicity $n-p$ and zero is an eigenvalue with multiplicity $p$.

ocram
  • 19,898
  • 5
  • 76
  • 77
  • 1
    (+1) Good answer. One can restrict attention to orthogonal, instead of unitary, $V$ since $Q$ is real and symmetric. Also, what is $\mathrm{SCR}$? I do not see it defined. By slightly rejiggering the argument, one can also avoid the use of a degenerate normal, in case that causes some consternation to those not familiar with it. – cardinal Dec 25 '11 at 17:29
  • 2
    @Cardinal. Good point. SCR ('Somme des Carrés Résiduels' in french) should have been RSS. – ocram Dec 25 '11 at 17:53
  • Thank you for the detailed answer Ocram! Some steps will require me to look more, but I have an outline to think about now - thanks! – Tal Galili Dec 25 '11 at 21:45
  • @Glen_b: Oh, I made an edit a couple of days ago to change SCR to SRR. I didn't remember that SCR is mentionned in my comment. Sorry for the confusion. – ocram Nov 18 '13 at 06:03
  • @Glen_b: It was supposed to mean RSS :-S Edited again. Thx – ocram Nov 18 '13 at 06:15
  • This assumes that $\varepsilon$ are normal and that $X$ are not random. There are whole fields of statistics where these assumptions do not hold ever. – mpiktas Nov 21 '13 at 19:51
  • Why $tr(Q) = n-p$? And why $K \sim N(0, \sigma^2 \Delta)$ if i$K = V' \hat{\epsilon}$? What are $K_{n-p+1}$ and the other $k$s? – Francesco Boi Aug 22 '19 at 15:01
11

IMHO, the matricial notation $Y=X\beta+\epsilon$ complicates things. Pure vector space language is cleaner. The model can be written $\boxed{Y=\mu + \sigma G}$ where $G$ has the standard normal distributon on $\mathbb{R}^n$ and $\mu$ is assumed to belong to a vector subspace $W \subset \mathbb{R}^n$.

Now the language of elementary geometry comes into play. The least-squares estimator $\hat\mu$ of $\mu$ is nothing but $P_WY$: the orthogonal projection of the observable $Y$ on the space $W$ to which $\mu$ is assumed to belong. The vector of residuals is $P^\perp_WY$: projection on the orthogonal complement $W^\perp$ of $W$ in $\mathbb{R^n}$. The dimension of $W^\perp$ is $\dim(W^\perp)=n-\dim(W)$.

Finally, $$P^\perp_WY = P^\perp_W(\mu + \sigma G) = 0 + \sigma P^\perp_WG,$$ and $P^\perp_WG$ has the standard normal distribution on $W^\perp$, hence its squared norm has the $\chi^2$ distribution with $\dim(W^\perp)$ degrees of freedom.

This demonstration uses only one theorem, actually a definition-theorem:

Definition and theorem. A random vector in $\mathbb{R}^n$ has the standard normal distribution on a vector space $U \subset \mathbb{R}^n$ if it takes its values in $U$ and its coordinates in one ($\iff$ in all) orthonormal basis of $U$ are independent one-dimensional standard normal distributions

(from this definition-theorem, Cochran's theorem is so obvious that it is not worth to state it)

Stéphane Laurent
  • 17,425
  • 5
  • 59
  • 101
2

There is a more general result that underlies many instances of the chi-squared distribution.


Quadratic form $Z^TAZ$ with standard normal $Z$ and symmetric idempotent $A$

Lemma: If $A$ is a symmetric and idempotent $n\times n$ real matrix and $Z\sim N(0,I_n)$ is a random vector of $n$ independent standard normal variables, then $Z^TAZ$ has chi-squared($r$) distribution, $r$ being the trace of $A$.

Proof. Use the decomposition lemma (below) to find an $n\times r$ matrix $U$ with orthonormal columns such that $A=UU^T$ and $r$ is the trace of $A$. Consider $N:=U^TZ$. Then $N$ is a random vector of $r$ variables having multivariate normal distribution with mean vector $0$ and covariance matrix $U^TU=I_r$. It follows that $ Z^T AZ = Z^TUU^TZ=N^TN $ is the sum of squares of $r$ IID standard normal variables, so it has chi-squared($r$) distribution.


Decomposition of symmetric idempotent matrix

Lemma: If $A$ is a symmetric and idempotent $n\times n$ real matrix, then $A=UU^T$ where $U$ is an $n\times r$ matrix with orthonormal columns, $r$ being the trace of $A$.

Proof. Since matrix $A$ is idempotent, its eigenvalues are zero and one, and the multiplicity of unit eigenvalues equals the rank $r$ of $A$, which in turn equals the trace of $A$. Apply the spectral theorem for symmetric matrices to write $A=UDU^T$ where $D$ is a diagonal matrix of the eigenvalues of $A$ and $U$ is an $n\times n$ orthogonal matrix whose columns are the corresponding eigenvectors. We can delete from $U$ the columns corresponding to zero eigenvalue, leaving an $n\times r$ matrix; $D$ then becomes the identity.


In the present situation, for the linear model $y=X\beta +\epsilon$ with $X$ of full rank $p$ we establish that the residual vector $\hat\epsilon:=y-X\hat\beta$ can be written $\hat\epsilon=(I-H)\epsilon$ where the hat matrix $H:=X(X^TX)^{-1}X^T$ is idempotent and symmetric. The same is true for $I-H$, so $\operatorname{RSS}:=\hat\epsilon^T\hat\epsilon=\epsilon^T(I-H)\epsilon$. The quadratic form lemma then asserts that $\operatorname{RSS}/\sigma^2$ has chi-squared($r$) distribution, with $r$ the trace of $I-H$. Since the trace of the hat matrix equals the rank of $X$, conclude $r=\operatorname{tr}(I-H)=n-\operatorname{tr}(H)=n-p$.

grand_chat
  • 2,632
  • 1
  • 8
  • 11