4

is it possible to do stepwise (direction = both) model selection in nested binary logistic regression in R? I would also appreciate if you can teach me how to get:

  • Hosmer-Lemeshow statitistic,
  • Odds ratio of the predictors,
  • Prediction success of the model.

I used lme4 package of R. This is the script I used to get the general model with all the independent variables:

nest.reg <- glmer(decision ~ age + education + children + (1|town), family = binomial, data = fish)

where:

  • fish -- dataframe
  • decision -- 1 or 0, whether the respondent exit or stay, respectively.
  • age, education and children -- independent variables.
  • town -- random effect (where our respondents are nested)

Now my problem is how to get the best model. I know how to do stepwise model selection but only for linear regression. (step( lm(decision ~ age + education + children, data = fish), direction +"both")). But this could not be used for binary logistic regression right? also when i add (1|town) to the formula to account for the effects of town, I get an error result.

By the way... I'm very much thankful to Manoel Galdino who provided me with the script on how to run nested logistic regression.

Thank you very much for your help.

Richard Muallil
  • 121
  • 2
  • 3
  • Hosmer-Lemeshow is considered obsolete: https://stats.stackexchange.com/questions/273966/logistic-regression-with-poor-goodness-of-fit-hosmer-lemeshow – kjetil b halvorsen May 14 '20 at 12:09

3 Answers3

8

I really appreciate the pointers to my book and papers and R package. Briefly, stepwise regression is invalid as it destroys all statistical properties of the result as well as faring poorly in predictive accuracy. There is no reason to use ROC curves to guide model selection (if model selection is even a good idea), because we have the optimum measure, the log-likelihood and its variants such as AIC. Thresholds for the dependent variable should be dealt with using ordinal regression instead of making a series of binary models. The Hosmer-Lemeshow test is now considered obsolete by many statisticians as well as the original authors. See the reference below (which proposes a better method, implemented in the rms package).

@ARTICLE{hos97com, author = {Hosmer, D. W. and Hosmer, T. and {le Cessie}, S. and Lemeshow, S.}, year = 1997, title = {A comparison of goodness-of-fit tests for the logistic regression model}, journal = Statistics in Medicine, volume = 16, pages = {965-980}, annote = {goodness-of-fit for binary logistic model;difficulty with Hosmer-Lemeshow statistic being dependent on how groups are defined;sum of squares test (see cop89unw);cumulative sum test;invalidity of naive test based on deviance;goodness-of-link function;simulation setup;see sta09sim} }

See also

@Article{sal09sim, author = {Stallard, Nigel}, title = {Simple tests for the external validation of mortality prediction scores}, journal = Statistics in Medicine, year = 2009, volume = 28, pages = {377-388}, annote = {low power of older Hosmer-Lemeshow test;avoiding grouping of predicted risks;logarithmic and quadratic test;scaled $\chi^2$ approximation;simulation setup; best power seems to be for the logarithmic (deviance) statistic and for the chi-square statistics that is like the sum of squared errors statistic except that each observation is weighted by $p(1-p)$} }

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • @Frank Harrell, I'm curious why wot use ROC with hierarchical models. As far as I understand the issue, AIC is not that good with hierarchical models. You are right about looking at deviance of the model, but if the sample size varies with inclusion/exclusion of variables, then how useful is to look at the deviance to compare models? I thought ROC curves might be useful, though I'm no expert in ROC curves and it's properties... – Manoel Galdino Jun 02 '11 at 17:22
  • 1
    Inclusion/exclusion of variables based on statistical tests is very problematic, and it's also a bit worrisome that data availability varies by which variables are in the model (multiple imputation may be in order). The use of ROC curves does not deal with that problem, and introduces a non-optimal objective criterion. Deviance/likelihood is an optimal information measure. ROC lacks power. – Frank Harrell Jun 02 '11 at 17:32
  • Thanks. I'll take a look at the issue of optimal measure and lack of power of ROC curve. All in all, model selection is a very hard issue. In Bayesian fitting, there is some discussion about using bayes factors or not to compare models. Gelman preferes posterior predictive checks and right now I'm following him. – Manoel Galdino Jun 02 '11 at 17:38
  • Speaking in general, model selection is not always a good idea. Penalization performs better. A quadratic penalty (like a Bayesian normal prior) does not yield parsimony but works well. An L1 penalty (lasso) yields a parsimonious model. These methods are based on maximizing a very well studied objective criterion which is a penalized likelihood. – Frank Harrell Jun 02 '11 at 18:55
  • Hi! thank you very much for your help. I've come up with my final model using the AIC criterion. although i had to run all possible models from the predictors i used w/o considering interaction among variables. My problem now is how to test for the Specificity (True negative) and Sensitivity (True positive) of my model. I've explored R including the rocplot in the zelig package but haven't come up with the results the i wanted. Can you please help me out? In Systat i can make a table (for prediction success) but Systat can't do nested binary log regression. Thanks a lot. I'm actually new in R. – Richard Muallil Jun 07 '11 at 11:24
  • Here's one of my final models: – Richard Muallil Jun 07 '11 at 11:25
  • Here's one of my final models: Catchmodel – Richard Muallil Jun 07 '11 at 11:34
  • I think you have engaged in data torture, and the inference from the "final" model will be badly distorted. And sensitivity, specificity, and ROC will just get in the way of understanding the regression. – Frank Harrell Jun 08 '11 at 22:58
  • Really? I'm so dead!=) honestly, i'm not a statistician. I just based my method on some published articles. I'm just trying to comply with the comments of the reviewers of our paper (which is to include Goodness of fit and prediction test). I originally ran our stat in Systat (stepwise model selection) which can also compute for prediction success (true pos, false pos, true neg and false neg) and the goodness of fit test. We reanalyzed our data using R to account for random effects (towns where our respondents are nested). But i can't do it since i'm new in R. thanks a lot.. – Richard Muallil Jun 09 '11 at 03:11
  • 4
    Most non-statisticians, and 40% of statisticians do not realize the severe damage that variable selection inflicts upon the final analysis. A partial cause is "phantom degrees of freedom". I cover this in my book Regression Modeling Strategies. Whenever possible, a good strategy is to formulate a model based on subject matter knowledge and guided by data availability, fit that model, assess its fit, use the bootstrap to validate its accuracy and the absence of significant overfitting, then use that one model for inference. Root cause: var. selection was invented before simulation. – Frank Harrell Jun 09 '11 at 12:45
  • Hi Frank. In your RMS book you mention the paper by Sauerbrei (92) regarding model selection with comment that the threshold to select variables was arbitrary. I wonder if you had seen this paper, and what you thoughts were? "On identifying significant edges in graphical models: Scutari 13" – user20650 Oct 24 '16 at 15:03
  • 1
    No, sorry I haven't seen that. – Frank Harrell Oct 25 '16 at 03:17
2

Disclaimer: This is merely a comment but it won't fit as such, so I'll leave it as a CW response.

Everything is already available in Frank Harrell's rms package (which model to choose, how to evaluate its predictive performance or how to validate it, how not to fall into the trap of overfitting or stepwise approach, etc.), with formal discussion in his textbook, Regression Modeling Strategies (Springer, 2001), and a nice set of handouts on his website.

Also, I would recommend the following papers if you're interested in predictive modeling:

  1. Aliferis, C.F., Statnikov, A., Tsamardinos, I., Schildcrout, J.S., Shepherd, B.E., and Harrell, F.E. Jr (2009). Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data. PLoS ONE 4(3): e4922.
  2. Harrell, F.E. Jr, Margolis, P.A., Gove, S., Mason, K.E., Mulholland, E.K., Lehmann, D., Muhe, L., Gatchalian, S., and Eichenwald, H.F.. (1998). Development of a clinical prediction model for an ordinal outcome: the World Health Organization Multicentre Study of Clinical Signs and Etiological agents of Pneumonia, Sepsis and Meningitis in Young Infants. WHO/ARI Young Infant Multicentre Study Group. Statistics in Medicine, 17(8): 909-44.
  3. Harrell, F.E. Jr, Lee, K.L., and Mark, D.B. (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4): 361-87.
chl
  • 50,972
  • 18
  • 205
  • 364
1

I'm here again =)!

It happens that right now I'm fitting a nested logistic regression and I have to choose the better model as well. Actually I don't know how to do stepwise with lme4, nonetheless I'm not sure it is advisble to use AIC to choose the best model (better fit). Take a look at this link; is dissuss very briefly when AIC may be apropriate with hierarchical (nested) models. Just to provide some backgroud on the link, it is a FAQ about DIC, which is like AIC and is an output of Bugs.

All in all, I don't know how to answer your question, but I can provide some guidance on how I choose my own models. I take a look at two things: ROC curves and fake data simulation. ROC curves are a good graphic way to compare the fit of models regarding true positive rates and false positive rates. Depending on what is more sensitive in your case, you can choose a model with higher true positive rate or higher true negative rates.

If you use Zelig package, I think you can fit models with lme4 and plot ROC curves. The documentation has some nice examples of this.

Another way (my prefered one) is to simulate data with parameters from the model estimated. Then, you check if the simulated data resembles the real data. This is a visual comparison, but can be very effetive. If you need reference on this procedure, Gelman and Hill's book is a standard texto book on this approach. For instance, assume you fitted the model:

P(y=1) = 10 + 2*x Then you create a new variable, y.fake, which is for (i in 1:n) y.fake ~ rbinom(1,1, 10 + 2*x[i])

Then you check if y.fake resembles the true y. You can compute, for instance, the number of cases that the y.fake is equal to the true y. You can ran more than one simulation to take into acount sample variability and you can take into acount the variability of estimated coeficients as well.

Hope this helps.

Manoel Galdino
  • 1,750
  • 1
  • 11
  • 18
  • Thanks a lot. You really are a great help!=) 1 quick question first, in choosing model using ROC do i still need to run ALL possible models? I have 8 independent variables which (if i computed it right) will have a total of 255 models (with at least one independent variable).. And i am actually analyzing 6 thresholds for the dependent variable (response; 0 = stay and 1 = exit). This means i will be dealing with 1530 total models. I wonder if R provides some model selection technique (like stepwise for linear models) that can simplify the work. thanks a lot. – Richard Muallil Jun 02 '11 at 10:44
  • @Frank is right about ordinal logistic. About model selection, I don't think you should test all 255 models. If you wanna try all combinations, why stop here? Why not try transformations of variables (log, square root etc.)? Why not try interactions terms? And quadratic and cubic terms? Do you see that in the limit you will end with more models than degrees of freedom? Try to use theory to assess which variables to include and look at descriptive statistics to assess visually if it does make sense to include or not a variable. – Manoel Galdino Jun 02 '11 at 17:26