12

I'm learning Linear Regression for Regression from "The Elements of Statistical Learning".

Why The variance-covariance matrix of the least squares parameter estimates is easily derived from (3.6) and is given by
$$ Var(\hat{\beta}) = (X^TX)^{-1}\sigma^2. $$
Typically one estimates the variance $\sigma^2$ by
$$ \hat{\sigma^2} = \frac{1}{N-p-1}\sum_{i=1}^N(y_i-\hat{y}_i)^2 $$

Could someone explain the above formulas in detail, like the following:
$$ E(\hat{\beta})=E((X^TX)^{-1}X^Ty) \\ =(X^TX)^{-1}X^TE(y) \\ =(X^TX)^{-1}X^TX\beta \\ =\beta $$

dimitriy
  • 31,081
  • 5
  • 63
  • 138
irwenqiang
  • 146
  • 1
  • 5
  • 1
    If "p" is the number of regressors _including_ the constant term usually to be found in a regression setup, then this formula is wrong. The last formula is correct only if we assume that the regressors are deterministic. Exactly what is contained in this book you study? Just "cookbook recipes"? – Alecos Papadopoulos Jul 29 '14 at 15:15
  • p the number of features.$\beta$ is the parameter vector to be optimized. – irwenqiang Jul 31 '14 at 03:46
  • Equation 3.6 is $\hat \beta = (X'X)^{-1}X'y$ – dimitriy Oct 22 '14 at 00:47
  • I think the answer is more clearly in this doc. https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf – Jiashun Xiao Sep 08 '18 at 08:44

2 Answers2

3

The other answer here and the answers on a later version of this question here [Covariance matrix of least squares estimator $\hat{\beta}$ are not correct

In the book you are referencing, the data $x_1,\dots,x_N$ ($x_i^{\top}$ is the ith row of $\mathbf{X}$) are not random. The authors say that the $y_i$ are uncorrelated with constant variance. And we have the formula $$ \hat{\beta} = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}. $$ That's really all they say. There is no assumption that the real distribution of $Y$ is a linear function of $X$ plus a noise. And there is no explicit assumption that $\mathbb{E}(Y|X) = 0$. So, if you try to work with the information you are actually given in the book, you'll do something like this:

First we compute the expectation: $$ \mathbb{E}(\hat{\beta}) = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) $$ So \begin{align} \mathbb{E}(\hat{\beta})\mathbb{E}(\hat{\beta})^T &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) \Bigl((\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y})\Bigr)^{\top} \\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) \mathbb{E}(\mathbf{y})^{\top} \mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1} \\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) \mathbb{E}(\mathbf{y})^{\top} \mathbf{X}\bigl((\mathbf{X}^{\top}\mathbf{X})^{-1}\bigr)^{\top} \\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y}) \mathbb{E}(\mathbf{y})^{\top} \mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1} \end{align} And \begin{align} \mathbb{E}(\hat{\beta}\hat{\beta}^T) &= \mathbb{E}\biggl((\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}\Bigl( (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}\Bigr)^{\top} \biggr)\\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbb{E}(\mathbf{y} \mathbf{y}^{\top}) \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1} \end{align} The variance-covariance matrix is the difference as usual, which comes out as \begin{align} &(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\bigl(\mathbb{E}(\mathbf{y} \mathbf{y}^{\top}) - \mathbb{E}(\mathbf{y}) \mathbb{E}(\mathbf{y})^{\top} \bigr) \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1} \\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\bigl(\sigma^2 I_{N\times N} \bigr) \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1} \\ &= (\mathbf{X}^{\top}\mathbf{X})^{-1}\sigma^2 \end{align}

So the only assumption that we had, I use explicitly at the end: We know the variance-covariance matrix of $\mathbf{y}$ is just $\sigma^2$ multiplied by the identity matrix.

T_M
  • 141
  • 3
0

Because $x_i$ are fixed, so

$$\mathrm{Var}[\hat{\beta}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathrm{Var}[\mathbf{y}] $$

And

$$\mathrm{Var}[\mathbf{y}] = \mathrm{Cov}[\mathbf{y}]= \left[\begin{array}{ccc} \sigma_{11} & \cdots & \sigma_{1 n} \\ \vdots & \ddots & \vdots \\ \sigma_{p1} & \cdots & \sigma_{n n} \end{array}\right] =\sigma^2 \mathbf{I} $$

Because $y_i$ are uncorrelated and have constant variance $\sigma^2$,

$$\sigma_{ij}=\mathrm{Cov}[y_i, y_j]=E[y_iy_j]-E[y_i]E[y_j]=\sigma^2\delta_{ij}$$

Therefore,

$$\mathrm{Var}[\hat{\beta}] = (\mathbf{X}^T\mathbf{X})^{-1} \sigma^2$$