0

I'm futzing with getting the SE of model residual standard deviations from a linear regression, and keep getting narrower errors than I should - and I'd like to figure out why.

The basic approach I'm taking is to fit a linear model. Draw simulated coefficients from a multivariate normal distribution. Calculate the RSS and from that, use sqrt(RSS/(n-2)) to calculate the model residual SD. Rinse and repeat 1K times, and then get the SD of the model residual SD.

But... I keep finding that I'm off by an order of magnitude at least. Here's an example in R.

First, the model.

library(palmerpenguins)
plot(bill_length_mm ~ flipper_length_mm, data = penguins)

pen <- lm(bill_length_mm ~ flipper_length_mm, data = penguins)

Then, the simulated coefficients.

library(mvtnorm)
coefTab <- data.frame(rmvnorm(n, coef(pen), sigma = vcov(pen)))

Now, on to getting the residual SD. A function!

get_resid_sd <- function(a, b, y, x){
  pred <- a+b*x
  res <- y - pred
  n <- sum(!is.na(res))
  sqrt(  sum(res^2, na.rm=T)/(n-2) )
}

And let's apply it to our coefficients

coefTab$sigma <- sapply(1:n,
       function(i){
         get_resid_sd(coefTab[i,1], coefTab[i,2], penguins$bill_length_mm,
                      penguins$flipper_length_mm)
       })

Now, the SE

> sd(coefTab$sigma)
[1] 0.01208458

OK, but.... to validate, let's use rstanarm as it naturally produces simulations with no extra work, and can produce equivalent results to lm() using stan_glm() with the appropriate optimizer and null priors.

library(rstanarm)
penStan <- stan_glm(bill_length_mm ~ flipper_length_mm, data = penguins,
                    algorithm = "optimizing", prior = NULL,
                    prior_intercept = NULL, prior_aux = NULL)

Now, the SE of the model residual SD....

> sd(as.matrix(penStan)[,3])
[1] 0.1639105

Huh. You see why I'm a) glad I checked myself and b) why I'm worried I did something tragically wrong.

Would love to know folks' thoughts, as if fixed, I think this is a killer example for students. Feel like I'm falling down on something obvious due to 2020 brain.

jebyrnes
  • 835
  • 6
  • 16

1 Answers1

1

$\text{SE}(\hat{\sigma})$ is not the uncertainty in $\hat{\sigma}$ induced by uncertainty in the parameter estimates, which is what you estimated. It's the uncertainty in $\hat{\sigma}$ coming from the data generating process. If we take repeated samples from the population, how does the estimate $\hat{\sigma}$ vary across them?

We can simulate this by repeatedly creating random data, fitting a model to the data, finding the standard deviation of the residuals, and looking at the distribution of those standard deviations.

make_some_data <- function(NOBS = 1000){
  beta <- c(-2,0.4)
  x <- rnorm(n = NOBS, mean = 5, sd = 1)
  y <- beta[1] + beta[2]*x + rnorm(n = NOBS)
  data.frame(x,y)
}
get_model_resids <- function(some_data){
  mod <- lm(y ~ x, data = some_data)
  residuals(mod)
}

estimatedsds <- rep(NA,1000)
for(i in 1:1000){
  estimatedsds[i] <- sd(get_model_resids(make_some_data()))
}
mean(estimatedsds)
# [1] 0.9988542   
sd(estimatedsds)
# [1] 0.02217692

To verify this, we can compare with a stan_glm fit on similar data.

library(rstanarm)
foo <- make_some_data()
stanmod <- stan_glm(y ~ x, data = foo,
                    algorithm = "optimizing", prior = NULL,
                    prior_intercept = NULL, prior_aux = NULL)
sd(as.matrix(stanmod)[,3])
# [1] 0.02255132

We can also compare with a bootstrap estimate of $\text{SE}(\hat{\sigma})$, which is a common way to estimate it.

library(boot)
boot_sds <- function(mydata, myindx){
  mydata <- mydata[myindx,]
  mod <- lm(y ~ x, data = mydata)
  sd(residuals(mod))
}
bootedsds <- boot(foo,boot_sds,R = 10000)
sd(bootedsds$t)
# [1] 0.02239548

You may also be interested in this link, which discusses the exact value of standard error of $\hat{\sigma}$ What is the standard error of the sample standard deviation?

David Luke Thiessen
  • 1,232
  • 2
  • 15