I am trying to develop a model for prediction of retention. The problem is that the retention is very rare - aprox. 0.2 %. So far I have been using logistical regression. Without much success however. For example, in the interval of predicted probability above 70 % I am getting 4 true retention clients and 157 wrongly predicted retentions.
I have spent quite a lot of time deciding what covariates to choose and what transformation to apply. I don't think I can improve that.
The question could be formulated also as the following. How to use (or what to use instead of) logistical regression when the response vector "Y" contains very few 1's (0.2 % in my case)?
If you think you should know what data I have, tell me and I will provide you with this information. But I don't think it's important. Anyway, I have enough data. About 1.5 mil rows, so about 3000 of 1's.