1

I am attempting to do a logistic regression bootstrap with R. The problem is I get high SE's. I'm not sure what to do about this or what it means. Does it mean that bootstrap does not work well for my particular data? Here is my code:

get.coeffic = function(data, indices){
  data    = data[indices,]
  mylogit = glm(F~B+D, data=data, family="binomial")
  return(mylogit$coefficients)
}

Call:
boot(data = Pres, statistic = logit.bootstrap, R = 1000)

Bootstrap Statistics :
       original      bias    std. error
t1* -10.8609610 -23.0604501  338.048398
t2*   0.2078474   0.4351766    6.387781

I also want to know that after bootstrapping, how would this help with my final regression model? That is, how do I find what regression coefficient do I use in my final model?

> fit <- glm(F ~ B + D , data = President, family = "binomial")
> summary(fit)
Call:
glm(formula = F ~ B + D, family = "binomial", data = President)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7699  -0.5073   0.1791   0.8147   1.2836  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -14.57829    8.98809  -1.622   0.1048  
B             0.15034    0.14433   1.042   0.2976  
D             0.13385    0.08052   1.662   0.0965 .
- --
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23.508  on 16  degrees of freedom
Residual deviance: 14.893  on 14  degrees of freedom
AIC: 20.893

Number of Fisher Scoring iterations: 5
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
ali.hash
  • 11
  • 1
  • 3
  • How many lines do you have? Could you please also add the output of 'summary' of the original model? Maybe the reason is perfect separation. – Michael M Aug 03 '15 at 20:43
  • Hi Michael, please see edit. Also, could you explain "perfect separation"? Thanks! – ali.hash Aug 03 '15 at 20:51
  • The bootstrap should not be used to tell you which coefficients to use in the final model. The final model should be the model that you pre-specify, if you want to preserve all aspects of statistical inference and minimize bias. – Frank Harrell Aug 03 '15 at 21:19
  • Are `B` & `D` highly correlated? The model as a whole is clearly significant `1-pchisq(23.508-14.893, 2)`, `# [1] 0.01346718`. – gung - Reinstate Monica Aug 03 '15 at 21:22
  • Thanks for the responses. Yes, the correlation between B, D is .49. @Frankc, could you explain? I thought bootstrap would help with stat. inference and minimize bias. – ali.hash Aug 03 '15 at 21:59
  • 1
    Bootstrap can help one to estimate bias and how much a predictive method falls apart. It is not generally used to devise new estimators. It is especially not good for variable selection unless you use a double bootstrap and you also have no collinearity. – Frank Harrell Aug 04 '15 at 17:50

2 Answers2

2

I don't follow your code, you call your data different things in different places, I don't see your function being used anywhere, etc. Setting that aside, I'm not sure there is a big problem with your model other than the fact that you don't have much data (I gather N = 17, which is pretty small). I don't think your standard errors would be that problematic if you had a more typical sample size.

Moreover, your model seems impressively good to me for a logistic regression model with so few data to work with. The reason neither variable is significant is clearly because they are correlated. This will expand your SEs, but wouldn't be bad if you had more data. As it is, your SEs are about one third larger than they would have been if your data were perfectly uncorrelated:

1/(1-.49^2)
# [1] 1.315963

That means the model doesn't know which of the two variables should be given credit for predicting the response. Nonetheless, there is good predictive ability amongst those variables somewhere, as can be seen by their combined significance:

1-pchisq(23.508-14.893, 2)
# [1] 0.01346718

As far as bootstrapping goes, it is used to get an estimate of the nature of the sampling distribution that doesn't rely on assumptions about normality. It may help you to read this excellent CV thread: Explaining to laypeople why bootstrapping works.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • I was also ending up with N = 17 (didn't get an answer to my question above)... (+1) for your answer. I was thinking why the bootstrap SE are so much larger than the original ones. Maybe it is because the number of 0 or 1 is so small that we have almost perfect separation in some bootstrap samples (with huge corresponding SE). Clearly I would not trust a logistic regression with 17 obs and 3 parameters very much. And absolutely mistrust the bootstrap results. I don't know why people think that bootstrap is a solution to small samples. It is not! – Michael M Aug 04 '15 at 11:25
  • Thanks @MichaelM. Bootstrapping may not work, but then, nothing may with insufficient sample sizes. I don't think separation is likely to be the issue here, though. Neither the estimated coefficient nor the SE are that large (they would tend towards infinity) & the number of Fisher Scoring iterations is only 5 (you often have ~20 or more w/ separation). I think this is just somewhat correlated variables w/ very few data. – gung - Reinstate Monica Aug 04 '15 at 13:13
  • 2
    @MichaelM, " I don't know why people think that bootstrap is a solution to small samples". I thought bootstrap would help with small samples. I've read examples where people use to estimate regression coefficients for their models (which is what I am trying to do). Could you explain why that is wrong? Thank you. – ali.hash Aug 04 '15 at 20:52
  • 2
    @ali.hash, to understand the bootstrap, it may help you to read the thread I linked in my answer. The bootstrap takes your data as an estimate of the population. If your dataset is too small, it is impossible for it to do a decent job of reflecting the population. – gung - Reinstate Monica Aug 04 '15 at 22:03
  • @gung, thank you. So lets say I have a larger dataset. Could I use bootstrap to estimate regression coefficients instead of their confidence intervals? – ali.hash Aug 04 '15 at 22:12
  • 1
    Bootstrapping isn't for the point estimate of your coefficients; it is for the SE / confidence intervals. – gung - Reinstate Monica Aug 04 '15 at 22:14
2

The problem here is the extremely small sample size (n=17). When you bootstrap two things can occur:

  1. You are going to get bootstrap samples with fewer ones for the dependent variable
  2. Therefore, in most samples your logistic estimates will be unstable—Resulting in very biased estimates

This might explain why you get huge standard errors. Tip: Check the distribution of ‘1’s in some of the bootstrap samples. With N=17,even the normal logistic is not guaranteed to converge.

subra
  • 791
  • 3
  • 8