I have a dataset where the positive class is 1.7%, which equates to about 40k positive cases and a total basis of approx 2.5m.
What is a realistic precision to achieve for the most likely to cancel? I am currently getting 10% precision for about 5% recall. My experience of working with low frequency / scarce data is that I can get precision of somewhere between 4 and 6 times the underlying rate. This gave an AUC of 0.73
Obviously this depends on the data I am feeding into the model so a little context: this is for a churn model, where customers have relatively long term contracts and we do not have perfect information on what offers the customer has from our competition.
What we do have: what products are in use, demographics, what price points, and what they have been historically for that customer, plus some information about levels of service provided.
There is more data that will be made available. I am being asked for what improvements are expected, which is an impossible ask in my opinion. But I had hoped to find some benchmarks on what is possible and sadly I have not found much... I don't find an AUC of 0.73 so bad, but I'm not sure if it the most effective measure in scarce predictions... As I say I am interested in the precision resulting from the models.
The winning model right now is xgboost with a scale_pos_weight of 12.
Appreciate your feedback, J