Background
Let me first say that I read this post and that I looked at the BMS
vignette.
I used the package sure
(CRAN, R journal), to calculate surrogate residuals of a polr
with resids
(see data below).* The problem is that these surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids (they don't converge from bootstrapping the residuals). If you run my example code, you will see that the result will be different every time you rerun it. I thought I could temporarily circumvent this issue, by using some form of averaging (for example Bayesian Model Averaging), until I find the solution for my overarching issue.
The goal would be to provide the reader a range for the variable of interest (because my regression output changes every time, it would be kind of weird to just report one output).
Question
So what I want to do is simple. I want to more or less run the regression 10 times (or one hundred times), and combine or average the outcomes. I thought I could use Bayesian Model Averaging for this. However, from the example they give I have not really been able to figure out what to do.
Essentially I just want to run exactly the same regression 10-100 times and take the average of all the values.
To repeat: The point is that, because the residual is different every time I run the glm, the outcome of the coefficient of interest is different. For that reason I would like to report and average of running this regression 10-100 times
I am aware (now) that this is perhaps a bit of a shoe horn approach. Nevertheless, until I find another solution, this the best way I can think of to report my output. I will obviously mention that this is not the most sound way to do it (I have already made a new post describing my overarching issue)
DATA
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$z1 <- df1$x
df1$x <- NULL
df1$x <- df2$y
df1$z2 <- df2$x
df1$y <- df3$x/10
mod1 <- polr(as.ordered(x) ~ z1 + z2, data=df1, method='probit')
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2 <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2)
.*The package is based on this paper, in the Journal of the American Statistical Association.
EDIT (an attempt at averaging three models):
I tried to do the following:
library(MuMIn) # https://cran.r-project.org/web/packages/MuMIn/MuMIn.pdf
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2a <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2a)
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2b <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2b)
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2c <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2c)
modellist <- list(mod2a, mod2b, mod2c)
summary(model.avg(modellist))
Is that the correct way to do it? It worked when I used lm
instead of glm
but now all output is NA
.
Model-averaged coefficients: (87 not defined because of singularities in all component models)
With lm instead of glm
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2a <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2a)
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2b <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2b)
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2c <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2c)
modellist <- list(mod2a, mod2b, mod2c)
summary(model.avg(modellist))
Call:
model.avg(object = modellist)
Component model call:
lm(formula = y ~ x + residuals + z1 + z2, data = df1)
Component models:
df logLik AICc delta weight
1234b 8 651.69 -1287.30 0.00 0.86
1234a 8 649.68 -1283.29 4.01 0.12
1234c 8 648.08 -1280.09 7.21 0.02
Term codes:
residuals x z1 z2
1 2 3 4
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) 0.3939836 0.0188525 0.0188639 20.886 < 2e-16 ***
x.L -0.0944743 0.0386522 0.0386739 2.443 0.01457 *
x.Q -0.0007507 0.0180071 0.0180181 0.042 0.96677
x.C 0.0252203 0.0140922 0.0141008 1.789 0.07368 .
x^4 -0.0008289 0.0093890 0.0093947 0.088 0.92970
residuals 0.0240304 0.0089833 0.0089882 2.674 0.00751 **
z1.L 0.0000000 0.0000000 0.0000000 NA NA
z1.Q 0.0000000 0.0000000 0.0000000 NA NA
z1.C 0.0000000 0.0000000 0.0000000 NA NA
z1^4 0.0000000 0.0000000 0.0000000 NA NA
z2 0.0026400 0.0032725 0.0032745 0.806 0.42012
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) 0.3939836 0.0188525 0.0188639 20.886 < 2e-16 ***
x.L -0.0944743 0.0386522 0.0386739 2.443 0.01457 *
x.Q -0.0007507 0.0180071 0.0180181 0.042 0.96677
x.C 0.0252203 0.0140922 0.0141008 1.789 0.07368 .
x^4 -0.0008289 0.0093890 0.0093947 0.088 0.92970
residuals 0.0240304 0.0089833 0.0089882 2.674 0.00751 **
z1.L NA NA NA NA NA
z1.Q NA NA NA NA NA
z1.C NA NA NA NA NA
z1^4 NA NA NA NA NA
z2 0.0026400 0.0032725 0.0032745 0.806 0.42012
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1