For my own understanding, I am interested in manually replicating the calculation of the standard errors of estimated coefficients as, for example, come with the output of the lm()
function in R
, but haven't been able to pin it down. What is the formula / implementation used?

- 132,789
- 81
- 357
- 650

- 1,673
- 3
- 11
- 7
-
12good question, many people know the regression from linear algebra point of view, where you solve the linear equation $X'X\beta=X'y$ and get the answer for beta. Not clear why we have standard error and assumption behind it. – Haitao Du Jul 19 '16 at 13:42
3 Answers
The linear model is written as $$ \left| \begin{array}{l} \mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \\ \mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I}), \end{array} \right.$$ where $\mathbf{y}$ denotes the vector of responses, $\mathbf{\beta}$ is the vector of fixed effects parameters, $\mathbf{X}$ is the corresponding design matrix whose columns are the values of the explanatory variables, and $\mathbf{\epsilon}$ is the vector of random errors.
It is well known that an estimate of $\mathbf{\beta}$ is given by (refer, e.g., to the wikipedia article) $$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$$ Hence $$ \textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1} (\mathbf{X}^{\prime} \mathbf{X}) (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}, $$ [reminder: $\textrm{Var}(AX)=A\times \textrm{Var}(X) \times A′$, for some random vector $X$ and some non-random matrix $A$]
so that $$ \widehat{\textrm{Var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}, $$ where $\hat{\sigma}^2$ can be obtained by the Mean Square Error (MSE) in the ANOVA table.
Example with a simple linear regression in R
#------generate one data set with epsilon ~ N(0, 0.25)------
seed <- 1152 #seed
n <- 100 #nb of observations
a <- 5 #intercept
b <- 2.7 #slope
set.seed(seed)
epsilon <- rnorm(n, mean=0, sd=sqrt(0.25))
x <- sample(x=c(0, 1), size=n, replace=TRUE)
y <- a + b * x + epsilon
#-----------------------------------------------------------
#------using lm------
mod <- lm(y ~ x)
#--------------------
#------using the explicit formulas------
X <- cbind(1, x)
betaHat <- solve(t(X) %*% X) %*% t(X) %*% y
var_betaHat <- anova(mod)[[3]][2] * solve(t(X) %*% X)
#---------------------------------------
#------comparison------
#estimate
> mod$coef
(Intercept) x
5.020261 2.755577
> c(betaHat[1], betaHat[2])
[1] 5.020261 2.755577
#standard error
> summary(mod)$coefficients[, 2]
(Intercept) x
0.06596021 0.09725302
> sqrt(diag(var_betaHat))
x
0.06596021 0.09725302
#----------------------
When there is a single explanatory variable, the model reduces to $$y_i = a + bx_i + \epsilon_i, \qquad i = 1, \dotsc, n$$ and $$\mathbf{X} = \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right), \qquad \mathbf{\beta} = \left( \begin{array}{c} a\\b \end{array} \right)$$ so that $$(\mathbf{X}^{\prime} \mathbf{X})^{-1} = \frac{1}{n\sum x_i^2 - (\sum x_i)^2} \left( \begin{array}{cc} \sum x_i^2 & -\sum x_i \\ -\sum x_i & n \end{array} \right)$$ and formulas become more transparant. For example, the standard error of the estimated slope is $$\sqrt{\widehat{\textrm{Var}}(\hat{b})} = \sqrt{[\hat{\sigma}^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}]_{22}} = \sqrt{\frac{n \hat{\sigma}^2}{n\sum x_i^2 - (\sum x_i)^2}}.$$
> num <- n * anova(mod)[[3]][2]
> denom <- n * sum(x^2) - sum(x)^2
> sqrt(num / denom)
[1] 0.09725302
-
Thanks for the thorough answer. So, I take it the last formula doesn't hold in the multivariate case? – ako Dec 01 '12 at 18:18
-
1No, the very last formula only works for the specific X matrix of the simple linear model. In the multivariate case, you have to use the general formula given above. – ocram Dec 02 '12 at 07:21
-
5
-
7@loganecolss: It comes from the fact that $\text{Var}(AX)=A\text{Var(X)}A'$, for some random vector $X$ and some non-random matrix $A$. – ocram Feb 09 '14 at 09:38
-
5note that these are the right answers for hand calculation, but the actual implementation used within `lm.fit`/`summary.lm` is a bit different, for stability and efficiency ... – Ben Bolker Nov 08 '15 at 19:51
-
This video derives the mean and variance too: https://www.youtube.com/watch?v=jyBtfhQsf44 – StatsStudent Apr 07 '16 at 23:07
-
To be clear, `anova(mod)[[3]][2]` only works for a model with one factor, right? The "3" picks out the "Mean Sq", and the "2" picks out the "Residuals" line? Also, is that the variance of the residuals? It looks a bit different. What exactly is it? – dfrankow Mar 16 '20 at 21:27
-
Is another way to think of that sigma^2 as "sum-squared error of the model divided by degrees of freedom"? – dfrankow Mar 16 '20 at 22:21
-
Hi, when calculating the confidence interval of the model coeffcients, do I need to divide $\sqrt{\hat{Var}(\hat{b})}$ by $\sqrt{n}$ get the standard error before multiply with critical value? – Stephen Ge Apr 02 '21 at 00:52
-
-
+1. I have learned regression from a bayesian perspective where I assume priors on $\beta$ as gaussian and naturally the posterior on $\beta$ gives the required variance. Just to clarify my understanding, I assume the above analysis is about the frequentist viewpoint? – swag2198 Aug 10 '21 at 04:07
-
1@swag2198: yes, parameter uncertainty is readily available from the posterior in the bayesian analysis. – ocram Aug 10 '21 at 05:30
The formulae for these can be found in any intermediate text on statistics, in particular, you can find them in Sheather (2009, Chapter 5), from where the following exercise is also taken (page 138).
The following R code computes the coefficient estimates and their standard errors manually
dfData <- as.data.frame(
read.csv("http://www.stat.tamu.edu/~sheather/book/docs/datasets/MichelinNY.csv",
header=T))
# using direct calculations
vY <- as.matrix(dfData[, -2])[, 5] # dependent variable
mX <- cbind(constant = 1, as.matrix(dfData[, -2])[, -5]) # design matrix
vBeta <- solve(t(mX)%*%mX, t(mX)%*%vY) # coefficient estimates
dSigmaSq <- sum((vY - mX%*%vBeta)^2)/(nrow(mX)-ncol(mX)) # estimate of sigma-squared
mVarCovar <- dSigmaSq*chol2inv(chol(t(mX)%*%mX)) # variance covariance matrix
vStdErr <- sqrt(diag(mVarCovar)) # coeff. est. standard errors
print(cbind(vBeta, vStdErr)) # output
which produces the output
vStdErr
constant -57.6003854 9.2336793
InMichelin 1.9931416 2.6357441
Food 0.2006282 0.6682711
Decor 2.2048571 0.3929987
Service 3.0597698 0.5705031
Compare to the output from lm()
:
# using lm()
names(dfData)
summary(lm(Price ~ InMichelin + Food + Decor + Service, data = dfData))
which produces the output:
Call:
lm(formula = Price ~ InMichelin + Food + Decor + Service, data = dfData)
Residuals:
Min 1Q Median 3Q Max
-20.898 -5.835 -0.755 3.457 105.785
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.6004 9.2337 -6.238 3.84e-09 ***
InMichelin 1.9931 2.6357 0.756 0.451
Food 0.2006 0.6683 0.300 0.764
Decor 2.2049 0.3930 5.610 8.76e-08 ***
Service 3.0598 0.5705 5.363 2.84e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.55 on 159 degrees of freedom
Multiple R-squared: 0.6344, Adjusted R-squared: 0.6252
F-statistic: 68.98 on 4 and 159 DF, p-value: < 2.2e-16

- 8,442
- 2
- 36
- 50
-
1Nice trick with the `solve()` function. This would be quite a bit longer without the matrix algebra. Is there a succinct way of performing that specific line with just basic operators? – ako Dec 01 '12 at 18:57
-
1@AkselO There is the well-known closed form expression for the OLS estimator, $\widehat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\boldsymbol{Y}$, which you can compute by explicitly computing the inverse of the $(\mathbf{X}'\mathbf{X})$ matrix (as @ ocram has done), but this gets tricky with ill-conditioned matrices. – tchakravarty Dec 01 '12 at 19:07
-
2
Part of Ocram's answer is wrong. Actually:
$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y} - (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}.$
$E(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$
And the comment of the first answer shows that more explanation of variance of coefficient is needed:
$\textrm{Var}(\hat{\mathbf{\beta}}) = E(\hat{\mathbf{\beta}}-E(\hat{\mathbf{\beta}}))^2=\textrm{Var}(- (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}) =(\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$
Edit
Thanks, I $\mathbf{wrongly}$ ignored the hat on that beta. The deduction above is $\mathbf{wrong}$. The correct result is:
1.$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}.$ (To get this equation, set the first order derivative of $\mathbf{SSR}$ on $\mathbf{\beta}$ equal to zero, for maxmizing $\mathbf{SSR}$)
2.$E(\hat{\mathbf{\beta}}|\mathbf{X}) = E((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} (\mathbf{X}\mathbf{\beta}+\mathbf{\epsilon})|\mathbf{X}) = \mathbf{\beta} + ((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime})E(\mathbf{\epsilon}|\mathbf{X}) = \mathbf{\beta}.$
3.$\textrm{Var}(\hat{\mathbf{\beta}}) = E(\hat{\mathbf{\beta}}-E(\hat{\mathbf{\beta}}|\mathbf{X}))^2=\textrm{Var}((\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}) =(\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$
Hopefully it helps.

- 233
- 2
- 11

- 39
- 3
-
2The derivation of the OLS estimator for the beta vector, $\hat{\boldsymbol \beta} = ({\bf X'X})^{-1}{\bf X'Y}$, is found in any decent regression textbook. In light of that, can you provide a proof that it should be $\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y} - (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{\epsilon}$ instead? – gung - Reinstate Monica Apr 06 '16 at 03:40
-
5Your $\hat\beta$ is not even an estimator, because $\epsilon$ is not observable! – whuber Apr 06 '16 at 14:55
-
This can also be viewed in this video: https://www.youtube.com/watch?v=jyBtfhQsf44 – StatsStudent Apr 07 '16 at 23:06