Why binary classification with highly imbalanced dataset output low probs?

Question

I'm doing a frauds detection project which is using imbalance dataset(almost, 9:1). And I using logistic, xgboost, and ligthgbm for Binary Classification.

When I predict my test set, I see all of Target's probabilities are so low. Is it wrong for my models? or Is it natural for imbalanced case?

Dave · Accepted Answer · 2022-01-31T14:19:46.980

1

Fraud is an uncommon event. Even when some signs of fraud are there, the odds are that the transaction was legitimate. Unless something is screaming at you, the probability of fraud would be low. Consequently, you might not demand a high probability to flag a case as potential fraud.

The class imbalance reflects the fact that fraud is uncommon. You can see how this impacts your predicted probability if you write out Bayes’ theorem with your predicted probability as the posterior probability of fraud, given the data, and the class ratio as the prior.

$$ P(Fraud\vert Features)=\dfrac{ P(Features\vert Fraud)P(Fraud) }{ P(Features) } $$

$P(Fraud\vert Features)$ is called the "posterior probability" of fraud. $P(Fraud)$ is called the "prior probability" of fraud and is equal to the proportion of cases that are fraud.

When the $P(Fraud)$ in the numerator is low, of course the other side of the equation will be low.

edited Jan 31 '22 at 14:19

answered Jan 31 '22 at 12:48

Dave

28,473
4
52
104

First, thanks for answering! I thought probability means how much assurance of model's prediction. – timesToLearn Jan 31 '22 at 12:53
@timesToLearn What do you mean by that? – Dave Jan 31 '22 at 12:54
If binary classification model output 0.1 for probability, I thought It means "10% chance that fraud will occur". So I think "is it okay for just 10%..? It can be reliable?" ps. I'm not good at English, so If confused you, sorry.. – timesToLearn Jan 31 '22 at 13:00
@timesToLearn A $10\%$ chance of fraud means exactly that: it probably isn’t fraud, but there’s some chance that it is fraud. Think of it in terms of a weather forecast. If there is a $10\%$ chance of rain, it probably won’t rain, but it might. (In fact, there should be rain about $10\%$ of the time the weather forecasts claims a $10\%$ chance of rain!) Ot would be nice to be able to say definitively that the case either absolutely is or absolutely is not fraud (or rain), but that might not be a realistic goal. – Dave Jan 31 '22 at 13:25
so you mean that probability means "maybe p% to be fraud" and high 'p' can exist in balanced cases. Am I right understood? – timesToLearn Jan 31 '22 at 13:46
Yes, but a high $p$ could happen for your case, too, just that you need to overcome the low prior probability, $P(Fraud)$. – Dave Jan 31 '22 at 14:18
I'm really really appreciated your answer!! so many people use the sampling method to solve those problems? – timesToLearn Jan 31 '22 at 14:39
Sampling method meaning something like upsampling the minority class, downsampling the majority class, or SMOTE to invent new observations? – Dave Jan 31 '22 at 14:40
oh yes. they are not good for overcoming the low prior probability? – timesToLearn Jan 31 '22 at 14:47
I, along with [other members](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) of the community, would argue that there is nothing to overcome. If the right answer is that there is a $10\%$ chance of fraud, that that is the right answer. You wouldn't want to trick yourself into thinking there is less chance of fraud or more chance of fraud than the correct $10\%$. // "Why do people do those?" you ask. I like the comment by Sycorax in the comments of the linked question. – Dave Jan 31 '22 at 14:54
Thanks again. It helped a lot. I will read the link you shared :) – timesToLearn Jan 31 '22 at 15:04

Why binary classification with highly imbalanced dataset output low probs?

1 Answers1