Logistical regression - very few 1's in response vector "Y"

Question

I am trying to develop a model for prediction of retention. The problem is that the retention is very rare - aprox. 0.2 %. So far I have been using logistical regression. Without much success however. For example, in the interval of predicted probability above 70 % I am getting 4 true retention clients and 157 wrongly predicted retentions.

I have spent quite a lot of time deciding what covariates to choose and what transformation to apply. I don't think I can improve that.

The question could be formulated also as the following. How to use (or what to use instead of) logistical regression when the response vector "Y" contains very few 1's (0.2 % in my case)?

If you think you should know what data I have, tell me and I will provide you with this information. But I don't think it's important. Anyway, I have enough data. About 1.5 mil rows, so about 3000 of 1's.

(1) usually use the term "logistic regression" (2) what are your events per variable. if >>10 then you likely have enough data. if EPV <20 could consider regularization (3) logistic regression will provide best model unless there are significant interactions or non-linear effects your not modeling. (4) prediction is hard. what makes you think your model isn't good — charles, Oct 06 '15 at 11:47

score 1 · Answer 1 · answered Oct 06 '15 at 11:49

1

You have a quite high lift in the scored data. So your predictions are not necessarily "bad". What are classification statistics (e.q AUC) in the test data set?

answered Oct 06 '15 at 11:49

Analyst

2,527
10
11

1

(+1) Though a 2.5% observed retention vs over 70% predicted retention suggests some miscalibration. – Scortchi - Reinstate Monica Oct 06 '15 at 12:54
I guess you are right. I will have to put more effort into the model. – T. Rubin Oct 08 '15 at 07:08

Logistical regression - very few 1's in response vector "Y"

1 Answers1