0

Background

Let me first say that I read this post and that I looked at the BMS vignette.

I used the package sure (CRAN, R journal), to calculate surrogate residuals of a polr with resids (see data below).* The problem is that these surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids (they don't converge from bootstrapping the residuals). If you run my example code, you will see that the result will be different every time you rerun it. I thought I could temporarily circumvent this issue, by using some form of averaging (for example Bayesian Model Averaging), until I find the solution for my overarching issue.

The goal would be to provide the reader a range for the variable of interest (because my regression output changes every time, it would be kind of weird to just report one output).

Question

So what I want to do is simple. I want to more or less run the regression 10 times (or one hundred times), and combine or average the outcomes. I thought I could use Bayesian Model Averaging for this. However, from the example they give I have not really been able to figure out what to do.

Essentially I just want to run exactly the same regression 10-100 times and take the average of all the values.

To repeat: The point is that, because the residual is different every time I run the glm, the outcome of the coefficient of interest is different. For that reason I would like to report and average of running this regression 10-100 times

I am aware (now) that this is perhaps a bit of a shoe horn approach. Nevertheless, until I find another solution, this the best way I can think of to report my output. I will obviously mention that this is not the most sound way to do it (I have already made a new post describing my overarching issue)

DATA

library(sure) # for residual function and sample data sets
library(MASS) # for polr function

df1 <- df1
df1$z1 <- df1$x
df1$x <- NULL
df1$x <- df2$y
df1$z2 <- df2$x
df1$y <- df3$x/10

mod1 <- polr(as.ordered(x) ~ z1 + z2, data=df1, method='probit')
df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2 <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2)

.*The package is based on this paper, in the Journal of the American Statistical Association.

EDIT (an attempt at averaging three models):

I tried to do the following:

library(MuMIn) # https://cran.r-project.org/web/packages/MuMIn/MuMIn.pdf

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2a <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2a)

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2b <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2b)

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2c <- glm(y ~ x + residuals + z1 + z2, family="quasibinomial", data=df1)
summary(mod2c)

modellist <- list(mod2a, mod2b, mod2c)

summary(model.avg(modellist))

Is that the correct way to do it? It worked when I used lm instead of glm but now all output is NA.

Model-averaged coefficients: (87 not defined because of singularities in all component models)

With lm instead of glm

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2a <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2a)

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2b <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2b)

df1$residuals <- resids(mod1, nsim=100, method="latent")
mod2c <- lm(y ~ x + residuals + z1 + z2, data=df1)
summary(mod2c)

modellist <- list(mod2a, mod2b, mod2c)

summary(model.avg(modellist))


Call:
model.avg(object = modellist)

Component model call: 
lm(formula = y ~ x + residuals + z1 + z2, data = df1)

Component models: 
      df logLik     AICc delta weight
1234b  8 651.69 -1287.30  0.00   0.86
1234a  8 649.68 -1283.29  4.01   0.12
1234c  8 648.08 -1280.09  7.21   0.02

Term codes: 
residuals         x        z1        z2 
        1         2         3         4 

Model-averaged coefficients:  
(full average) 
              Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  0.3939836  0.0188525   0.0188639  20.886  < 2e-16 ***
x.L         -0.0944743  0.0386522   0.0386739   2.443  0.01457 *  
x.Q         -0.0007507  0.0180071   0.0180181   0.042  0.96677    
x.C          0.0252203  0.0140922   0.0141008   1.789  0.07368 .  
x^4         -0.0008289  0.0093890   0.0093947   0.088  0.92970    
residuals    0.0240304  0.0089833   0.0089882   2.674  0.00751 ** 
z1.L         0.0000000  0.0000000   0.0000000      NA       NA    
z1.Q         0.0000000  0.0000000   0.0000000      NA       NA    
z1.C         0.0000000  0.0000000   0.0000000      NA       NA    
z1^4         0.0000000  0.0000000   0.0000000      NA       NA    
z2           0.0026400  0.0032725   0.0032745   0.806  0.42012    
 
(conditional average) 
              Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  0.3939836  0.0188525   0.0188639  20.886  < 2e-16 ***
x.L         -0.0944743  0.0386522   0.0386739   2.443  0.01457 *  
x.Q         -0.0007507  0.0180071   0.0180181   0.042  0.96677    
x.C          0.0252203  0.0140922   0.0141008   1.789  0.07368 .  
x^4         -0.0008289  0.0093890   0.0093947   0.088  0.92970    
residuals    0.0240304  0.0089833   0.0089882   2.674  0.00751 ** 
z1.L                NA         NA          NA      NA       NA    
z1.Q                NA         NA          NA      NA       NA    
z1.C                NA         NA          NA      NA       NA    
z1^4                NA         NA          NA      NA       NA    
z2           0.0026400  0.0032725   0.0032745   0.806  0.42012    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Tom
  • 209
  • 4
  • 17
  • you don't have to use Bayesian model averaging for this, since you are running the same regression, you can just average the models with equal weights – rep_ho Apr 15 '21 at 12:01
  • @rep_ho Thank you very much for your comment. Do you know of any package that does this? – Tom Apr 15 '21 at 12:05
  • This looks a little bit like a request for code help, which people may consider off topic. Can you clarify the statistical question / *that* this is a statistical question? – gung - Reinstate Monica Apr 15 '21 at 13:18
  • Can you clarify why you're doing this? Why do you need to average the residuals? What is the larger goal here? – gung - Reinstate Monica Apr 15 '21 at 13:19
  • @gung-ReinstateMonica I get your point. It started out as a purely statistical question without any code. I just simply want to know how to solve my problem of averaging a coupl of glm's. Then I started reporting what I had been trying.. – Tom Apr 15 '21 at 13:20
  • @gung-ReinstateMonica Isn't it clear from the first two paragraphs? I have an unstable residual in the model because of which the outcome of the model changes all the time. Because I cannot just report a random outcome, I want to average the outcomes, and report that. – Tom Apr 15 '21 at 13:22
  • 1
    lol. I understand your frustration, @Tom. You try to make it better 1 way, then someone comes along & complains about that. "You can please some of the people..." – gung - Reinstate Monica Apr 15 '21 at 13:24
  • We don't actually have your data. I don't see why you would get different outcomes if you run `polr` on the same dataset multiple times. I do. – gung - Reinstate Monica Apr 15 '21 at 13:26
  • @gung-ReinstateMonica It's because the residual calculated by the `sure` package produces a different residual each time. This part: "these surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids" is straight from the explanation of the package. – Tom Apr 15 '21 at 13:28
  • @gung-ReinstateMonica You can actually try to run my example code, it will give you different outputs.. – Tom Apr 15 '21 at 13:29
  • This is an [XY Problem](http://xyproblem.info/). What is the larger point, here? Reading your code more carefully, I'm guessing you want to assess something like a mediation model, where the mediator is ordinal. Is that the ultimate goal? – gung - Reinstate Monica Apr 15 '21 at 13:31
  • It sounds to me like the problem you're trying to solve is ``, and you're wondering if `` is a good way to go about it. Is that fair? Because if your real question is "``?" then I would suggest only asking about that. As its written right now, the question appears to be an [XY Problem](http://xyproblem.info/). – gung - Reinstate Monica Apr 15 '21 at 13:32
  • I am trying to apply a Control Function (CF)/Two Stage Residual Inclusion (2SRI) to a function which has non-linear dependent variable and an ordinal endogenous variable. – Tom Apr 15 '21 at 13:33
  • I would suggest you ask about that, rather than a broken attempt to shoehorn it. – gung - Reinstate Monica Apr 15 '21 at 13:35
  • Hahaha, I get your point now. I will make a new question. – Tom Apr 15 '21 at 13:37
  • I am just a little bit scared that I won't get a proper or any answer to that. – Tom Apr 15 '21 at 13:38
  • 1
    @gung-ReinstateMonica I made a new question (https://stats.stackexchange.com/questions/519852/how-can-i-overcome-non-converging-residuals-when-using-a-control-function-cf). Is this question better? Also: Any suggestion for a better title? – Tom Apr 15 '21 at 14:10

0 Answers0