42

Background

Suppose we have an Ordinary Least Squares model where we have $k$ coefficients in our regression model, $$\mathbf{y}=\mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}$$

where $\mathbf{\beta}$ is an $(k\times1)$ vector of coefficients, $\mathbf{X}$ is the design matrix defined by

$$\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1\;(k-1)} \\ 1 & x_{21} & \dots & & \vdots \\ \vdots & & \ddots & & \vdots \\ 1 & x_{n1} & \dots & \dots & x_{n\;(k-1)} \end{pmatrix}$$ and the errors are IID normal, $$\mathbf{\epsilon} \sim \mathcal{N}\left(\mathbf{0},\sigma^2 \mathbf{I}\right) \;.$$

We minimize the sum-of-squared-errors by setting our estimates for $\mathbf{\beta}$ to be $$\mathbf{\hat{\beta}}= (\mathbf{X^T X})^{-1}\mathbf{X}^T \mathbf{y}\;. $$

An unbiased estimator of $\sigma^2$ is $$s^2 = \frac{\left\Vert \mathbf{y}-\mathbf{\hat{y}}\right\Vert ^2}{n-p}$$ where $\mathbf{\hat{y}} \equiv \mathbf{X} \mathbf{\hat{\beta}}$ (ref).

The covariance of $\mathbf{\hat{\beta}}$ is given by $$\operatorname{Cov}\left(\mathbf{\hat{\beta}}\right) = \sigma^2 \mathbf{C}$$ where $\mathbf{C}\equiv(\mathbf{X}^T\mathbf{X})^{-1}$ (ref) .

Question

How can I prove that for $\hat\beta_i$, $$\frac{\hat{\beta}_i - \beta_i} {s_{\hat{\beta}_i}} \sim t_{n-k}$$ where $t_{n-k}$ is a t-distribution with $(n-k)$ degrees of freedom, and the standard error of $\hat{\beta}_i$ is estimated by $s_{\hat{\beta}_i} = s\sqrt{c_{ii}}$.


My attempts

I know that for $n$ random variables sampled from $x\sim\mathcal{N}\left(\mu, \sigma^2\right)$, you can show that $$\frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1} $$ by rewriting the LHS as $$\frac{ \left(\frac{\bar x - \mu}{\sigma/\sqrt{n}}\right) } {\sqrt{s^2/\sigma^2}}$$ and realizing that the numertor is a standard normal distribution, and the denominator is square root of a Chi-square distribution with df=(n-1) and divided by (n-1) (ref). And therefore it follows a t-distribution with df=(n-1) (ref).

I was unable to extend this proof to my question...

Any ideas? I'm aware of this question, but they don't explicitly prove it, they just give a rule of thumb, saying "each predictor costs you a degree of freedom".

Garrett
  • 601
  • 1
  • 6
  • 10
  • Because $\hat\beta_i$ is a linear combination of jointly Normal variables, it has a Normal distribution. Therefore *all* you need do are (1) establish that $\mathbb{E}(\hat\beta_i)=\beta_i$; (2) show that $s_{\hat\beta_i}^2$ is an unbiased estimator of $\text{Var}(\hat\beta_i)$; and (3) demonstrate the degrees of freedom in $s_{\hat\beta_i}$ is $n-k$. The latter has been proven on this site in several places, such as http://stats.stackexchange.com/a/16931. I suspect you already know how to do (1) and (2). – whuber Oct 01 '14 at 15:27

1 Answers1

40

Since $$\begin{align*} \hat\beta &= (X^TX)^{-1}X^TY \\ &= (X^TX)^{-1}X^T(X\beta + \varepsilon) \\ &= \beta + (X^TX)^{-1}X^T\varepsilon \end{align*}$$ we know that $$\hat\beta-\beta \sim \mathcal{N}(0,\sigma^2 (X^TX)^{-1})$$ and thus we know that for each component $k$ of $\hat\beta$, $$\hat\beta_k -\beta_k \sim \mathcal{N}(0, \sigma^2 S_{kk})$$ where $S_{kk}$ is the $k^\text{th}$ diagonal element of $(X^TX)^{-1}$. Thus, we know that $$z_k = \frac{\hat\beta_k -\beta_k}{\sqrt{\sigma^2 S_{kk}}} \sim \mathcal{N}(0,1).$$

Take note of the statement of the Theorem for the Distribution of an Idempotent Quadratic Form in a Standard Normal Vector (Theorem B.8 in Greene):

If $x\sim\mathcal{N}(0,I)$ and $A$ is symmetric and idempotent, then $x^TAx$ is distributed $\chi^2_{\nu}$ where $\nu$ is the rank of $A$.

Let $\hat\varepsilon$ denote the regression residual vector and let $$M=I_n - X(X^TX)^{-1}X^T \text{,}$$ which is the residual maker matrix (i.e. $My=\hat\varepsilon$). It's easy to verify that $M$ is symmetric and idempotent.

Let $$s^2 = \frac{\hat\varepsilon^T \hat\varepsilon}{n-p}$$ be an estimator for $\sigma^2$.

We then need to do some linear algebra. Note these three linear algebra properties:

  • The rank of an idempotent matrix is its trace.
  • $\operatorname{Tr}(A_1+A_2) = \operatorname{Tr}(A_1) + \operatorname{Tr}(A_2)$
  • $\operatorname{Tr}(A_1A_2) = \operatorname{Tr}(A_2A_1)$ if $A_1$ is $n_1 \times n_2$ and $A_2$ is $n_2 \times n_1$ (this property is critical for the below to work)

So $$\begin{align*} \operatorname{rank}(M) = \operatorname{Tr}(M) &= \operatorname{Tr}(I_n - X(X^TX)^{-1}X^T) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}\left( X(X^TX)^{-1}X^T) \right) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}\left( (X^TX)^{-1}X^TX) \right) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}(I_p) \\ &=n-p \end{align*}$$

Then $$\begin{align*} V = \frac{(n-p)s^2}{\sigma^2} = \frac{\hat\varepsilon^T\hat\varepsilon}{\sigma^2} = \left(\frac{\varepsilon}{\sigma}\right)^T M \left(\frac{\varepsilon}{\sigma}\right). \end{align*}$$

Applying the Theorem for the Distribution of an Idempotent Quadratic Form in a Standard Normal Vector (stated above), we know that $V \sim \chi^2_{n-p}$.

Since you assumed that $\varepsilon$ is normally distributed, then $\hat\beta$ is independent of $\hat\varepsilon$, and since $s^2$ is a function of $\hat\varepsilon$, then $s^2$ is also independent of $\hat\beta$. Thus, $z_k$ and $V$ are independent of each other.

Then, $$\begin{align*} t_k = \frac{z_k}{\sqrt{V/(n-p)}} \end{align*}$$ is the ratio of a standard Normal distribution with the square root of a Chi-squared distribution with the same degrees of freedom (i.e. $n-p$), which is a characterization of the $t$ distribution. Therefore, the statistic $t_k$ has a $t$ distribution with $n-p$ degrees of freedom.

It can then be algebraically manipulated into a more familiar form.

$$\begin{align*} t_k &= \frac{\frac{\hat\beta_k -\beta_k}{\sqrt{\sigma^2 S_{kk}}}}{\sqrt{\frac{(n-p)s^2}{\sigma^2}/(n-p)}} \\ &= \frac{\frac{\hat\beta_k -\beta_k}{\sqrt{S_{kk}}}}{\sqrt{s^2}} = \frac{\hat\beta_k -\beta_k}{\sqrt{s^2 S_{kk}}} \\ &= \frac{\hat\beta_k -\beta_k}{\operatorname{se}\left(\hat\beta_k \right)} \end{align*}$$

Blue Marker
  • 1,360
  • 11
  • 13
  • Also a side question: for the `Theorem for the Distribution of an Idempotent Quadratic Form in a Standard Normal Vector`, don't we also need $A$ to be symmetric? Unfortunately, I don't have Greene, so I can't see the proof although I saw that [Wikipedia had the same form as you](https://en.wikipedia.org/wiki/Chi-squared_distribution#Relation_to_other_distributions). However, a counter example seems to be the idempotent matrix $A=\begin{pmatrix}1&1\\0&0\end{pmatrix}$ which leads to $x_1^2+x_1 x_2$ which is not Chi-Squared since it could take on negative values... – Garrett Oct 01 '14 at 09:17
  • 1
    @Garrett My apologies, $A$ should be both symmetric and idempotent. A proof is provided as Theorem 3 in this document: http://www2.econ.iastate.edu/classes/econ671/hallam/documents/QUAD_NORM.pdf Luckily, $M$ is symmetric as well as idempotent. – Blue Marker Oct 01 '14 at 13:29
  • 2
    $A$ is merely *a* matrix representation of a quadratic form. Every quadratic form has a symmetric representation, so the requirement of symmetry of $A$ is implicit in the statement of the theorem. (People do not use asymmetric matrices to represent quadratic forms.) Thus the quadratic form $(x_1,x_2)\to x_1^2+x_1x_2$ is uniquely represented by the matrix $A=\begin{pmatrix}1&1/2\\1/2&0\end{pmatrix}$ which is *not* idempotent. – whuber Oct 01 '14 at 15:33
  • this is a great explanation for nonrandom x. Can you explain the deriviation if x is assumed to be random? – denizen of the north Feb 13 '18 at 14:07
  • 3
    Why does $\epsilon\sim N(0,\sigma^2)$ imply $\hat{\beta}$ is independent of $\hat{\epsilon}$? Not quite following there. – Glassjawed Oct 25 '18 at 15:59
  • 3
    @Glassjawed As both $\hat{\beta}$ and $\hat{\varepsilon}$ are multivariate normally distributed, then uncorrelatedness implies independence. Using expressions $\hat{\beta} = \beta + \left(X^{\top}X\right)^{-1}X^{\top}\varepsilon$ and $\hat{\varepsilon} = M\varepsilon$ from above, we can show that $\operatorname{Cov}\left(\hat{\beta}, \hat{\varepsilon}\right) = \mathbf{0}_{p\times n}$. – rzch May 01 '19 at 01:11
  • Why is this line; \begin{align*} V = \frac{(n-p)s^2}{\sigma^2} = \frac{\hat\varepsilon^T\hat\varepsilon}{\sigma^2} = \left(\frac{\varepsilon}{\sigma}\right)^T M \left(\frac{\varepsilon}{\sigma}\right). \end{align*} And not: \begin{align*} V = \frac{(n-p)s^2}{\sigma^2} = \frac{\hat\varepsilon^T\hat\varepsilon}{\sigma^2} = \left(\frac{y}{\sigma}\right)^T M \left(\frac{y}{\sigma}\right). \end{align*} If $My=\hat\varepsilon$? – JDoe2 Sep 19 '20 at 23:56
  • A text I went through argued on pg 21 that the coefficients are multivariate normally distributed rather than t-distributed? What is causing the discrepancy here http://home.cc.umanitoba.ca/~godwinrt/4042/material/part3.pdf – WetlabStudent Feb 18 '21 at 04:24