How are the standard errors of coefficients calculated in a regression?

Question

For my own understanding, I am interested in manually replicating the calculation of the standard errors of estimated coefficients as, for example, come with the output of the lm() function in R, but haven't been able to pin it down. What is the formula / implementation used?

good question, many people know the regression from linear algebra point of view, where you solve the linear equation $X'X\beta=X'y$ and get the answer for beta. Not clear why we have standard error and assumption behind it. — Haitao Du, Jul 19 '16 at 13:42

score 161 · Accepted Answer · edited Mar 16 '20 at 21:51

The linear model is written as $$ \left| \begin{array}{l} \mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \\ \mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I}), \end{array} \right.$$ where $\mathbf{y}$ denotes the vector of responses, $\mathbf{\beta}$ is the vector of fixed effects parameters, $\mathbf{X}$ is the corresponding design matrix whose columns are the values of the explanatory variables, and $\mathbf{\epsilon}$ is the vector of random errors.

It is well known that an estimate of $\mathbf{\beta}$ is given by (refer, e.g., to the wikipedia article) $$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$$ Hence $$ \textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1} (\mathbf{X}^{\prime} \mathbf{X}) (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}, $$ [reminder: $\textrm{Var}(AX)=A\times \textrm{Var}(X) \times A′$, for some random vector $X$ and some non-random matrix $A$]

so that $$ \widehat{\textrm{Var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}, $$ where $\hat{\sigma}^2$ can be obtained by the Mean Square Error (MSE) in the ANOVA table.

Example with a simple linear regression in R

#------generate one data set with epsilon ~ N(0, 0.25)------
seed <- 1152 #seed
n <- 100     #nb of observations
a <- 5       #intercept
b <- 2.7     #slope

set.seed(seed)
epsilon <- rnorm(n, mean=0, sd=sqrt(0.25))
x <- sample(x=c(0, 1), size=n, replace=TRUE)
y <- a + b * x + epsilon
#-----------------------------------------------------------

#------using lm------
mod <- lm(y ~ x)
#--------------------

#------using the explicit formulas------
X <- cbind(1, x)
betaHat <- solve(t(X) %*% X) %*% t(X) %*% y
var_betaHat <- anova(mod)[[3]][2] * solve(t(X) %*% X)
#---------------------------------------

#------comparison------
#estimate
> mod$coef
(Intercept)           x 
   5.020261    2.755577 

> c(betaHat[1], betaHat[2])
[1] 5.020261 2.755577

#standard error
> summary(mod)$coefficients[, 2]
(Intercept)           x 
 0.06596021  0.09725302 

> sqrt(diag(var_betaHat))
                    x 
0.06596021 0.09725302 
#----------------------

When there is a single explanatory variable, the model reduces to $$y_i = a + bx_i + \epsilon_i, \qquad i = 1, \dotsc, n$$ and $$\mathbf{X} = \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right), \qquad \mathbf{\beta} = \left( \begin{array}{c} a\\b \end{array} \right)$$ so that $$(\mathbf{X}^{\prime} \mathbf{X})^{-1} = \frac{1}{n\sum x_i^2 - (\sum x_i)^2} \left( \begin{array}{cc} \sum x_i^2 & -\sum x_i \\ -\sum x_i & n \end{array} \right)$$ and formulas become more transparant. For example, the standard error of the estimated slope is $$\sqrt{\widehat{\textrm{Var}}(\hat{b})} = \sqrt{[\hat{\sigma}^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}]_{22}} = \sqrt{\frac{n \hat{\sigma}^2}{n\sum x_i^2 - (\sum x_i)^2}}.$$

> num <- n * anova(mod)[[3]][2]
> denom <- n * sum(x^2) - sum(x)^2
> sqrt(num / denom)
[1] 0.09725302

Thanks for the thorough answer. So, I take it the last formula doesn't hold in the multivariate case? — ako, Dec 01 '12 at 18:18
No, the very last formula only works for the specific X matrix of the simple linear model. In the multivariate case, you have to use the general formula given above. — ocram, Dec 02 '12 at 07:21
@loganecolss: It comes from the fact that $\text{Var}(AX)=A\text{Var(X)}A'$, for some random vector $X$ and some non-random matrix $A$. — ocram, Feb 09 '14 at 09:38
note that these are the right answers for hand calculation, but the actual implementation used within `lm.fit`/`summary.lm` is a bit different, for stability and efficiency ... — Ben Bolker, Nov 08 '15 at 19:51
This video derives the mean and variance too: https://www.youtube.com/watch?v=jyBtfhQsf44 — StatsStudent, Apr 07 '16 at 23:07
To be clear, `anova(mod)[[3]][2]` only works for a model with one factor, right? The "3" picks out the "Mean Sq", and the "2" picks out the "Residuals" line? Also, is that the variance of the residuals? It looks a bit different. What exactly is it? — dfrankow, Mar 16 '20 at 21:27
Is another way to think of that sigma^2 as "sum-squared error of the model divided by degrees of freedom"? — dfrankow, Mar 16 '20 at 22:21
Hi, when calculating the confidence interval of the model coeffcients, do I need to divide $\sqrt{\hat{Var}(\hat{b})}$ by $\sqrt{n}$ get the standard error before multiply with critical value? — Stephen Ge, Apr 02 '21 at 00:52
+1. I have learned regression from a bayesian perspective where I assume priors on $\beta$ as gaussian and naturally the posterior on $\beta$ gives the required variance. Just to clarify my understanding, I assume the above analysis is about the frequentist viewpoint? — swag2198, Aug 10 '21 at 04:07
@swag2198: yes, parameter uncertainty is readily available from the posterior in the bayesian analysis. — ocram, Aug 10 '21 at 05:30

tchakravarty · Answer 2 · 2012-12-01T13:13:49.420

The formulae for these can be found in any intermediate text on statistics, in particular, you can find them in Sheather (2009, Chapter 5), from where the following exercise is also taken (page 138).

The following R code computes the coefficient estimates and their standard errors manually

dfData <- as.data.frame(
  read.csv("http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv",
                   header=T))

# using direct calculations
vY <- as.matrix(dfData[, -2])[, 5]                        # dependent variable
mX <- cbind(constant = 1, as.matrix(dfData[, -2])[, -5])  # design matrix

vBeta <- solve(t(mX)%*%mX, t(mX)%*%vY)                    # coefficient estimates
dSigmaSq <- sum((vY - mX%*%vBeta)^2)/(nrow(mX)-ncol(mX))  # estimate of sigma-squared
mVarCovar <- dSigmaSq*chol2inv(chol(t(mX)%*%mX))          # variance covariance matrix
vStdErr <- sqrt(diag(mVarCovar))                          # coeff. est. standard errors
print(cbind(vBeta, vStdErr))                              # output

which produces the output

                         vStdErr
constant   -57.6003854 9.2336793
InMichelin   1.9931416 2.6357441
Food         0.2006282 0.6682711
Decor        2.2048571 0.3929987
Service      3.0597698 0.5705031

Compare to the output from lm():

# using lm()
names(dfData)
summary(lm(Price ~ InMichelin + Food + Decor + Service, data = dfData))

which produces the output:

Call:
lm(formula = Price ~ InMichelin + Food + Decor + Service, data = dfData)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.898  -5.835  -0.755   3.457 105.785 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.6004     9.2337  -6.238 3.84e-09 ***
InMichelin    1.9931     2.6357   0.756    0.451    
Food          0.2006     0.6683   0.300    0.764    
Decor         2.2049     0.3930   5.610 8.76e-08 ***
Service       3.0598     0.5705   5.363 2.84e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 13.55 on 159 degrees of freedom
Multiple R-squared: 0.6344, Adjusted R-squared: 0.6252 
F-statistic: 68.98 on 4 and 159 DF,  p-value: < 2.2e-16

Nice trick with the `solve()` function. This would be quite a bit longer without the matrix algebra. Is there a succinct way of performing that specific line with just basic operators? — ako, Dec 01 '12 at 18:57
@AkselO There is the well-known closed form expression for the OLS estimator, $\widehat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\boldsymbol{Y}$, which you can compute by explicitly computing the inverse of the $(\mathbf{X}'\mathbf{X})$ matrix (as @ ocram has done), but this gets tricky with ill-conditioned matrices. — tchakravarty, Dec 01 '12 at 19:07
The book website is now at http://gattonweb.uky.edu/sheather/book/. — , Oct 25 '19 at 12:33

score 2 · Answer 3 · edited Mar 26 '17 at 23:21

Part of Ocram's answer is wrong. Actually:

$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y} - (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}.$

$E(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$

And the comment of the first answer shows that more explanation of variance of coefficient is needed:

$\textrm{Var}(\hat{\mathbf{\beta}}) = E(\hat{\mathbf{\beta}}-E(\hat{\mathbf{\beta}}))^2=\textrm{Var}(- (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}) =(\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$

Edit

Thanks, I $\mathbf{wrongly}$ ignored the hat on that beta. The deduction above is $\mathbf{wrong}$. The correct result is:

1.$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$ (To get this equation, set the first order derivative of $\mathbf{SSR}$ on $\mathbf{\beta}$ equal to zero, for maxmizing $\mathbf{SSR}$)

2.$E(\hat{\mathbf{\beta}}|\mathbf{X}) = E((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} (\mathbf{X}\mathbf{\beta}+\mathbf{\epsilon})|\mathbf{X}) = \mathbf{\beta} + ((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime})E(\mathbf{\epsilon}|\mathbf{X}) = \mathbf{\beta}.$

3.$\textrm{Var}(\hat{\mathbf{\beta}}) = E(\hat{\mathbf{\beta}}-E(\hat{\mathbf{\beta}}|\mathbf{X}))^2=\textrm{Var}((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}) =(\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$

Hopefully it helps.

The derivation of the OLS estimator for the beta vector, $\hat{\boldsymbol \beta} = ({\bf X'X})^{-1}{\bf X'Y}$, is found in any decent regression textbook. In light of that, can you provide a proof that it should be $\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y} - (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}$ instead? — gung - Reinstate Monica, Apr 06 '16 at 03:40
Your $\hat\beta$ is not even an estimator, because $\epsilon$ is not observable! — whuber, Apr 06 '16 at 14:55
This can also be viewed in this video: https://www.youtube.com/watch?v=jyBtfhQsf44 — StatsStudent, Apr 07 '16 at 23:06

How are the standard errors of coefficients calculated in a regression?

3 Answers3

Edit

Linked

Related