I am working with a dataset of 1000 individuals, 200 of which are disease positive. I have run a logistic regression with 25 predictors to identify overall which variables are significantly predictive. Straightforward...
However, I also want to identify which variables account for the greatest amount of variability for males vs. females, and see if there are differences in which variables pop. I considered modeling gender x predictor interaction terms, but that essentially doubles my number of predictors. I proceeded with a forward logistic regression and what I noticed was that by the last iteration, the model correctly identified a high percentage of non-disease group (>95%) but was very poor in correctly identifying the disease group. If anything, I would prefer a false-positive model (for clinical reasons)!
So I played around and took a random sample of 200 from the non-disease group and ran analyses with those individuals and found that the final iteration of the forward LR correctly predicted a high percentage of both groups. Therefore it seemed that using the whole sample yielded a model biased toward the larger group.
In reading through these pages and other sources, it seems that sub-sampling isn't viewed positively regarding LR, but I could not find anything about using it in an iterative, stepwise procedure.
So my questions are:
1) Is sub-sampling acceptable for a stepwise LR with such a disparate proportion of dichotomous variable?
2) If not, what other procedure(s) should I consider? (e.g., exact logistic regression?)