Standard Error for a Parameter in Ordinary Least Squares

Question

I am studying parameters generated by the method of Ordinary Least Squares, in particular, a parameter's associated standard error. Wikipedia suggests calculating

$$s^2=\frac{S\left(\hat\beta\right)}{n-p}=\frac{\rvert\rvert y-X\hat\beta\rvert\rvert^2}{n-p}$$

describing this as

the standard error of the regression (SER), or standard error of the equation (SEE).

This, however, is not sufficient to calculate the standard error for a given parameter.

I found this, which suggests

For each parameter, the standard error can be obtained by:

$$s_{\hat \beta _j}=s \sqrt{C_{jj}}$$

where $C_{jj}$ is the j$^{th}$ diagonal element of $(X'X)^{-1}$ (note that $(X'WX)^{-1}$ is used for weight fitting).

Note, that using this technique yields precisely the same standard errors as those summarized by the lm method in R.

Where does this $C$ come from? I have been reading about this for days, and this is the first reference I have seen to this matrix.

See http://stats.stackexchange.com/questions/135201, http://stats.stackexchange.com/questions/104704, and http://stats.stackexchange.com/questions/180478 — whuber, Jun 05 '16 at 02:34

score 13 · Answer 1 · edited Feb 12 '20 at 22:36

In matrix notation we have data $\left (\mathbf y, \mathbf X\right)$ and we consider the model

$$\mathbf y = \mathbf X\beta + \mathbf u$$

where for the moment we only assume that the regressor matrix contains a series of ones, so that we can safely assume that the "error term" $\mathbf u$ has zero mean. We do not as yet make any statistical/probabilistic assumptions. Calculating the unknown betas by Ordinary Least Squares is a mathematical approximation method that needs no statistical assumptions. We obtain

$$\hat \beta = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf y$$

This is the (orthogonal) Linear Projection coefficient vector, and, as a mathematical approximation story, it stops here.

Now we want to talk about the "standard error" of the estimates. But that is a statistical concept, and so we must assume something random and probabilistic. Assume that the regressors are all deterministic, but $\mathbf u$ is a random variable. Due to the regressor matrix containing a series of ones, we then have $E(\mathbf u) = \mathbf 0$, where $E$ denotes the expected value. We have

$$\hat \beta = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf y = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\left(\mathbf X\beta + \mathbf u\right) = \beta +\left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf u$$

$$\implies \hat \beta -\beta = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf u$$

Since $\beta$ is a constant, we have that $\text{Var}(\hat \beta -\beta) = \text{Var}(\hat \beta)$, where the Variance is the square of the standard deviation. The multivariate version of the variance, is

$$\text{Var}(\hat \beta -\beta) = E\Big[(\hat \beta -\beta)(\hat \beta -\beta)'\Big] - E\Big[(\hat \beta -\beta)\Big]E\Big[(\hat \beta -\beta)\Big]'$$

where the prime denotes the transpose. Substituting,

$$\text{Var}(\hat \beta) = E\Big[\left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf u\mathbf u'\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1} \Big] - E\Big[\left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf u\Big]E\Big[\left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf u\Big]'$$

Since we have assumed that $\mathbf X$ is deterministic, the expected value applies only to $\mathbf u$ so we have

$$\text{Var}(\hat \beta) = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'E(\mathbf u\mathbf u')\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1} - \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'E(\mathbf u)E(\mathbf u)'\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1} $$

Since $E(\mathbf u) =\mathbf 0$ we are left with

$$\text{Var}(\hat \beta) = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'E(\mathbf u\mathbf u')\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1} $$

Now comes another benchmark statistical assumption: the $\mathbf u$ is "homoskedastic" which means

$$\text{Var}(\mathbf u) = E(\mathbf u\mathbf u') = \sigma^2I$$

where $\sigma^2 >0$ is the common variance of each element of the error vector, and $I$ is the identity matrix. Substituting we get

$$\text{Var}(\hat \beta) = \left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\sigma^2I\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1} =\sigma^2\left(\mathbf X' \mathbf X\right) ^{-1} \mathbf X'\mathbf X\left(\mathbf X' \mathbf X\right) ^{-1}$$

$$\text{Var}(\hat \beta) = \sigma^2\left(\mathbf X' \mathbf X\right) ^{-1}$$

The $\sigma^2$ is estimated by $s^2$ as in the OP question, and the diagonal elements of $\left(\mathbf X' \mathbf X\right) ^{-1}$, each multiplied by $s^2$, is the variance of the corresponding element of the estimated beta vector. Taking the square root leads to the standard error of each element. The off-diagonal elements of the matrix, multiplied by $\sigma^2$, give the estimated covariances between the elements of the beta vector.

All of this makes sense, except why do we call this the "standard error", when its the square root of the variance, which I think most would call the standard deviation? — colin, Jul 17 '20 at 11:09
@colin I reckon because it is the "error of our estimate". Sometimes terminology is historically, not rationally, determined. — Alecos Papadopoulos, Jul 17 '20 at 13:52

Standard Error for a Parameter in Ordinary Least Squares

1 Answers1

Linked