3

I am building a marketing model based on logistic regression. It's a customer attrition model. The event rate is very low i.e 0.1%. I have more than 1000 predictors. I know there is a rule - Minimum 10 events per predictor. I want to know - Does this rule exist before dimensionality reduction (feature extraction) with PCA and Information value? Should i consider this rule based on my original 1500 variables or does it exist for significant variables that came after applying variable selection techniques such as Stepwise Regression , PCA etc?

generic_user
  • 11,981
  • 8
  • 40
  • 63
Riya
  • 589
  • 2
  • 7
  • 15

1 Answers1

10

A 20:1 rule is better, or use 15:1 as a compromise. This refers to the number of candidate variables, e.g., m/15 if m is the number of events. You are in trouble. Stepwise regression won't help. Your best bet is to use the first m/15 principal components and regress these against $Y$. When you can reduce dimensionality in a way that is masked to $Y$ you can count the number of candidate variables as equaling the dimensionality of the reduced space.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322