What are benchmarks for precision when working with unbalanced data?

Question

I have a dataset where the positive class is 1.7%, which equates to about 40k positive cases and a total basis of approx 2.5m.
What is a realistic precision to achieve for the most likely to cancel? I am currently getting 10% precision for about 5% recall. My experience of working with low frequency / scarce data is that I can get precision of somewhere between 4 and 6 times the underlying rate. This gave an AUC of 0.73
Obviously this depends on the data I am feeding into the model so a little context: this is for a churn model, where customers have relatively long term contracts and we do not have perfect information on what offers the customer has from our competition. What we do have: what products are in use, demographics, what price points, and what they have been historically for that customer, plus some information about levels of service provided.

There is more data that will be made available. I am being asked for what improvements are expected, which is an impossible ask in my opinion. But I had hoped to find some benchmarks on what is possible and sadly I have not found much... I don't find an AUC of 0.73 so bad, but I'm not sure if it the most effective measure in scarce predictions... As I say I am interested in the precision resulting from the models.

The winning model right now is xgboost with a scale_pos_weight of 12.

Appreciate your feedback, J

Nice questions albeit a bit broad. (+1) tried to cover the main points in my post below. — usεr11852, Mar 22 '20 at 21:03

score 1 · Answer 1 · answered Mar 22 '20 at 21:02

There are two questions raised in the post. A. what is a good metric to achieve in an imbalanced problem and B. what improvements are too be expected with more data become available. Both are rather broad questions but I will try to be be somewhat succinct.

You are absolutely correct that Precision & Recall are potentially misleading metrics in an imbalanced problem. The CV.SE thread on: "When is unbalanced data really a problem in Machine Learning?" is an excellent read on the matter to get one started. Regarding your particular setting: using AUC-ROC at first instance is a good step because it gives a realistic baseline (0.50), how better we do compared to chance. Similar metrics like Cohen's kappa and lift-curves are also very helpful. I would suggest incorporating them in your analysis. (When optimising a model using AUC-PR is also a potentially good idea but it is harder to interpret directly as it has a variable baseline). A major angle when dealing with an imbalanced datasets is the cost of misclassification, this explored with the context of cost-sensitive learning. Especially for a customer churn model this is very prominent. Aside the standard adage: "A False Positive can have a different misclassification cost that a False Negative", customers themselves have different misclassification costs. A young super-market customer who spends on average \$20-25 per week is less valuable than an older customer who spends on average $150 per week when doing their weekly shopping. We can define cost-sensitive functions to evaluate against with XGBoost and when using other frameworks too.

Improvement coming from additional data can be presented by using learning curves; they allow us to estimate how much gains (in terms of our metric of interest) we should expect when adding more training data. The 1994 paper "Learning Curves: Asymptotic Values and Rate of Convergence " by Cortes et al. is a seminal on the matter. Figueroa et al. "Predicting sample size required for classification performance (2012) offers a more modern an exposition (as well as a more modern way to calculate learning curves). CV.SE has an interesting thread on the matter here: "How to know if a learning curve from SVM model suffers from bias or variance?" it covers SVM but the learnings of it apply for any classifier.

So to recap:

There is not single realistic precision (or recall) to aim when dealing with imbalanced learning task. It is far more reasonable to utilise misclassification costs and/or use a simple model as a baseline and show improvements from there one.
The impact of getting more data can be use explored through learning curves. If they do not show additional improvements, do not be alarmed. This is potentially a hint that the question asked might need to be changed. And on that matter another CV.SE gem: "How to know that your machine learning problem is hopeless?".

What are benchmarks for precision when working with unbalanced data?

1 Answers1