Stepwise logistic regression

Question

I am working with a dataset of 1000 individuals, 200 of which are disease positive. I have run a logistic regression with 25 predictors to identify overall which variables are significantly predictive. Straightforward...

However, I also want to identify which variables account for the greatest amount of variability for males vs. females, and see if there are differences in which variables pop. I considered modeling gender x predictor interaction terms, but that essentially doubles my number of predictors. I proceeded with a forward logistic regression and what I noticed was that by the last iteration, the model correctly identified a high percentage of non-disease group (>95%) but was very poor in correctly identifying the disease group. If anything, I would prefer a false-positive model (for clinical reasons)!

So I played around and took a random sample of 200 from the non-disease group and ran analyses with those individuals and found that the final iteration of the forward LR correctly predicted a high percentage of both groups. Therefore it seemed that using the whole sample yielded a model biased toward the larger group.

In reading through these pages and other sources, it seems that sub-sampling isn't viewed positively regarding LR, but I could not find anything about using it in an iterative, stepwise procedure.

So my questions are:

1) Is sub-sampling acceptable for a stepwise LR with such a disparate proportion of dichotomous variable?

2) If not, what other procedure(s) should I consider? (e.g., exact logistic regression?)

I don't know but came up with a lot more questions. (1) 25 variables with 200 events --> 8 EPV. Is this an issue? (2) Are you interested in prediction or characterizing interaction? (choose one) (3) Are there issues with forward selection? (4) `So I played around and took a random sample of 200 from the non-disease group` pharagraph does not make sense. (5) What do you mean by subsampling? (6) How should one evaluate a logistic regression model? — charles, Mar 09 '15 at 16:27
" ... the model correctly identified a high percentage of non-disease group (>95%) but was very poor in correctly identifying the disease group. ... " How did you reach this conclusion? By making some hard classification, based on some cutoff? That is the wrong way to go about, it introduces a non-proper scoring rule! Search this site for "scoring rule". So that's the wrong way to evaluate the model! So tell us how you reached that conclusion. — kjetil b halvorsen, Mar 09 '15 at 16:40
Charles: 1) Overall 1000 participants, but yes 200 of those were "events." 2) Both really. First, determining the significant predictors for the overall sample and Second, characterization of the interaction (Male/Female). 3)I'm not aware of any issues with forward selection 4) Sorry - I have 800 people w/o disease and 200 with. So I took a random sample of 200/800 to match the sample size of those WITH the disease 5) See #4. 6) We are evaluating based on the percentage of correct classification and Nagelkerke's Rsq — Bill Black, Mar 09 '15 at 16:48
Mr. Halvorsen: No hard-fast rule, but it was disparate. 95% for the disease group and 55% for the non-disease group at the final (9th) iteration. After the first iteration, the non-disease group was correctly classified 85% of the time, and the disease group 75%. Therefore, with each iteration the model got better with the non-disease group and worse with the disease group, which is where my primary concern and confusion is coming from. I will look up "scoring rule" to better familiarize myself with your point. — Bill Black, Mar 09 '15 at 16:53
About scoring rules, see F Harrells answer http://stats.stackexchange.com/questions/86917/is-a-lower-training-accuracy-possible-in-overfitting-one-class-svm/87057#87057 and user777: http://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 — kjetil b halvorsen, Mar 09 '15 at 21:27
I'd recommend taking a look at bootstrap bagging or boosting techniques for identification of risk factors. — StatsStudent, Mar 10 '15 at 03:22

score 5 · Answer 1 · edited Mar 09 '15 at 20:52

The simple answer is No. Subsampling will not help.
If by subsampling you mean a balanced sample so that the ratio of events changes from 200/1000 to 200/400. This is only used in classification models and is of no use (generally) in maximum-likelihood / probability models.

What the comments are trying to suggest is that there are many other larger issues revealed in questions that could be textbook chapters by themselves:

Sample size of logistic models is measured by number of events, model building as events-per-variable. With 8 EPV (assuming all are continuous predictors otherwise less) your sample size in relationship to your number of predictors is small which is going to cause issues.
Detecting interactions is notoriously difficult.
Forward/backward or combination variable selection methods have major issues. This a popular topic on stackexchange crossvalidated. It is likely overemphasized with limited issues with EPV>50. But this is the classic situation where you are going to get mislead by automated variable selection methods (Austin: Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality)
variable selection is hard
prediction models and descriptive models often require different methods. Not always. Differences often overemphasized. But from the limited information available it seems like these two goals are going to be difficult to combine in this case.
Evaluation of logistic models avoid using correctly classified. Classification in logistic regression often based on arbitrary probability cut-off. Fair review: Steyerberg: "Assessing the performance..." http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/

Stepwise logistic regression

1 Answers1