2

I have data that have been collected using case-control procedures, in which the population of positive cases is collected with a random sample of negative cases. This yields 62 positive cases and 179 controls. There are 58 possible predictor variables (mostly numeric, two factors).

My goal is not classification per se--I am in the social sciences and significance testing is a big deal in my discipline, for better or worse. Rather, my collaborators want to make some inferences about the predictors that increase the probability of being a positive case. Obviously, the data are unbalanced, non-randomly sampled, and I have low power. I am trying to select variables for a logistic regression.

As an exploratory option, I have run a random forest with various specifications to try and find important variables. My false positive rate is, as expected, only about .03, as there are many more positive cases than negative. My false negative rate is quite high, about .70. Adjusting the weights doesn't seem to help with this.

What would be a good procedure for variable selection that could then be used to estimate rare events logistic regressions using the reduced set of variables?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Trey
  • 146
  • 5
  • You may give RF one more chance with [Boruta](http://cran.r-project.org/web/packages/Boruta/index.html), yet be aware that the chance of overfitting any method is huge -- some validation is a must. –  Jul 07 '12 at 11:45
  • @mbq Thanks. I'll give it a shot. I also had some luck in reducing the bootstrap sample size in the `randomForest()` function, since the `classwt` option seemed to have no interpretable effect. – Trey Jul 07 '12 at 18:47
  • Possible duplicate of [Does an unbalanced sample matter when doing logistic regression?](http://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression) – kjetil b halvorsen Nov 03 '16 at 08:21
  • I like to use the R library 'Boruta' which evaluates permuation-importance. It could be of use here. – EngrStudent Jun 29 '21 at 15:27

0 Answers0