0

If I use model that provide output probability in imbalanced case (say ratio between majority and minority class is 100 : 1), I saw that the output probability of data points from majority class is very High (say 99% or so), and much higher than output probability of data points from minority class. The problem is: In case of abnormality detection in banking or in many cases in medical study, we just want to detect the minority class. So I want to increase the output probability of minority class. What can we do in this case? I searched many sources on the internet and papers, but did not see any solutions to this problem. Maybe because people in machine learning mostly care about some metrics like accuracy, then they just apply under/over sampling to improve performance.

Thank you for reading my question.

Huy Nguyen
  • 557
  • 5
  • 10
  • 1
    You “just want to detect the minority class”? Just call everything a member of the minority class! Then you will have perfect performance! If this is not an acceptable solution, why is perfect ability to detect the cases of interest not perfect for your work? – Dave Aug 15 '21 at 12:07
  • Dave: Maybe the way i said in original post make you feel so extreme about this. The main problem is i just want to improve out put probability of minority class compare to majority class. Moreover. in practice, you can not just predict every thing is in minority class. Since (in banking/medical/marketing), we do not have infinite budget to deal with all cases. – Huy Nguyen Aug 15 '21 at 12:11
  • In other words, you care about both kinds of misclassifications, calling minority classes majority and calling majority classes minority? – Dave Aug 15 '21 at 12:15
  • No. It is not about accuracy or so any more. If i cared about this, i would do just under/over sampling, and accuracy will increase. I mainly want to increase estimated probability (in the right way) of samples from minority class. For example, one sample from minority class has estimated probability is 0.7, it's still correctly classified, but then when i want to get top sample with highest probability for a marketing campaign, this sample is not in the list since the estimated probability of samples from majority class are much much higher (say 0.95 or so). – Huy Nguyen Aug 15 '21 at 12:24
  • 2
    Harrell discusses marketing in one of his blog posts, among the links below. Why do you (seemingly) want wrong probabilities of membership?https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Aug 15 '21 at 12:30
  • 1
    See this answer on benefit of using sensitivity and specificity instead of accuracy for small class sizes. https://stats.stackexchange.com/a/533900/318288 –  Aug 15 '21 at 15:27
  • [Harrell also dislikes sensitivity and specificity.](https://stats.stackexchange.com/a/502634/247274) // You still have yet to say why you want incorrect probabilities of class membership. – Dave Aug 26 '21 at 16:01

1 Answers1

0

Use a generative classifier that learns the likelihood of the class of interest. Perhaps start with a Naive Bayes classifier with a uniform prior.

Jayaram Iyer
  • 198
  • 5
  • How does a discriminative model not predict adequately? – Dave Aug 15 '21 at 13:00
  • A discriminative model perhaps has a higher tendency to learn the relative frequency of occurrences of the target classes. The generative model learns the distribution of each class in isolation and hence a uniform prior puts them on equal footing. – Jayaram Iyer Aug 15 '21 at 13:08