0

I am running a stepwise binary logit regression in Stata using 14 independent variables. Two of the independent variables are dummies (assuming a value of 0 or 1). I've tested the independent variables for multicollinearity and adapted them by standardizing or using the natural logarithm of their values in order to mitigate this issue (VIF<2.5). The normal model runs smoothly; however, when I want to bootstrap the sample (# of observations: 73) with 1000 replications I receive p-values of 1.0000. Furthermore, the results conclude with the note: "one or more parameters could not be estimated in 314 bootstrap replicates; standard-error estimates include only complete replications."

Two questions: 1. Is the VIF threshold that I used correct (VIF<2.5)? Which other ways are there to get rid of multicollinearity, without dropping one of the variables? 2. Since I don't assume that multicollinearity is an issue anymore, what else could I have done wrong concerning my bootstraping methodology?

Many thanks in advance for your answer(s)!

Best! Tim

Tim
  • 1
  • 1
  • 1
    Your approach is not honest about the number of parameters estimated. The transformation estimation process needs to be part of the bootstrap as does every other modeling step that utilized $Y$. Collinearity on the other handle, can often ignore $Y$ and can be dealt with pre-outcome modeling. There is no need to compute $P$-values using the bootstrap as you already have those from the original model fit. – Frank Harrell May 13 '14 at 12:43
  • Frank, thanks a lot for your quick reply. To put it into the words of a layperson: this means that I do not need to bootstrap my sample? Isn't the initial sample size of 73 too small to receive appropriate results? Furthermore, what do you mean by "not honest"? That the transformations I chose are not consistent with each other? Unfortunately, the issue of multicollinearity appears when I use a consistent approach. – Tim May 13 '14 at 13:04
  • You are effectively estimating several more parameters when you try different transformations. You need to let the bootstrap repeate *from scratch* all the modeling steps each time, including examining transformations. [This is why just fitting regression splines if often a great approach. The bootstrap just refits the regression splines for each re-sample.] – Frank Harrell May 13 '14 at 16:26
  • Concerning your question about $n=73$, I wouldn't expect the bootstrap to improve on the accuracy. – Frank Harrell May 13 '14 at 17:11

1 Answers1

3

Consider not doing stepwise resgression, which is a good way to almost insure biased results:

Malek, M. H. and Coburn, D. E. B. J. W. (2007). On the inappropriateness of stepwise regression analysis for model building and testing. European Journal of Applied Physiology, 101(2):263–264.

Steyerberg, E. W., Eijkemans, M. J., and Habbema, J. D. F. (1999). Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of clinical epidemiology, 52(10):935–942.

Whittingham, M., Stephens, P., Bradbury, R., and Freckleton, R. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75(5):1182–1189.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • Alexis, thanks for your answer. However, and since I am applying backward stepwise regression, I already include all of the independent variables in my first model which yields the described results for the bootstrapped sample. – Tim May 13 '14 at 13:02
  • Which does not get you around the fact that you are, among other things, selecting for the most heteroscedastic predictors. – Alexis May 13 '14 at 17:02