R lm covariance matrix manual calculation: failure

Question

I want to get the variance-covariance matrix manually, then take it's diagonal elements to derive coefficient standard errors, according to:

sqrt(diag(vcov(lm_obj)))

My code:

dummy_data <- data.frame(
     x1 = rnorm(100, 0, 1),
     x2 = rnorm(100, 0, 1),
     y = round(runif(100, 0, 1), 0)
)


dummy_glm <- lm(y ~ x1 + x2,
                 data = dummy_data
                )

summary(dummy_glm)

B <- as.matrix(dummy_glm$coefficients, ncol = 1)

X <- as.matrix(cbind(rep(1, times = 100), dummy_data[, c('x1', 'x2')]), ncol = 3)

## manually do regression calc

XtX <- t(X) %*% X

det(XtX) # > 0 hence there is inverse

XtX_inv <- solve(XtX)

XtX_inv_diag <- diag(XtX_inv)

# standard errors for each coefficient

st_err_B <- sqrt(XtX_inv_diag)

st_err_B_alternative <- sqrt(diag(vcov(dummy_glm)))

vcov(dummy_glm) == XtX_inv

I found my calculation of XtX_inv does not match vcov(dummy_glm). Could you please hint to a mistake?

Hint: Don't you think the standard errors of the estimates ought to have *something* to do with the response vector `y`? — whuber, Aug 25 '17 at 16:50
Sounds reasonable. I think I forgot to add the residuals: XtX_inv — Alexey Burnakov, Aug 25 '17 at 17:01
That's right. You might find it quickest to consult the code rather than look up the formula. `vcov.lm` refers to `summary.lm` which shows the role of `rdf` ("residual degrees of freedom") in the estimate `R` makes. — whuber, Aug 25 '17 at 17:02
Very good! That's basically right, but if you are ever inclined to use such a formula, you will need to take more care in counting the variables. Two of the traps awaiting you are (1) dealing with collinear variables--many of which `R` will automatically detect and remove--and (2) counting categorical variables. That's why the variable count is originally computed by the fitting procedure (such as `lm`) and passed along to other procedures like `summary.lm` and `vcov.lm`. You will notice, too, that `solve` is not invoked: for numeric accuracy, `R` uses a QR decomposition performed by `lm`. — whuber, Aug 25 '17 at 17:18
I never knew that. Thank you for this insight. I have taken a dive in this math not just for student home work. I actually wanted to show my Python-colleague how to compute the covariance matrix, since they do not have any of vcov.lm or lm, or summary.lm. It appears I learned myself a lot. — Alexey Burnakov, Aug 25 '17 at 17:22

score 3 · Accepted Answer · answered Aug 25 '17 at 17:47

Solution for linear regression (lm):

XtX_inv <- solve(XtX) * sum(dummy_glm$residuals ^ 2) / (nrow(dummy_data) - ncol(dummy_data) + 1 - 1)

Solution for logistic regression (glm(...family = binomial(link = 'logit')):

pi <- dummy_glm$fitted.values

w <- pi * (1 - pi)

v <- diag(w, length(w), length(w))

XtX_inv <- solve(t(X) %*% v %*% X)

These seem to match almost perfectly with the results by vcov.

R lm covariance matrix manual calculation: failure

1 Answers1

Linked