I'm trying to classify instances between 2 classes ("good" and "bad"). My ultimate goal is to be able to predict good instances, but I don't need to identify all good instances. For example, say I have 1000 good ones and 1000 bad ones in my test data, I would like to pick the 100 that are more likely to be good, with minimal number of false positives. I can theoretically build a model and use a threshold that predicts 10% of good ones, but what I would really like to do is have the algorithm optimize towards my goal instead (and not pick the algorithm with the optimal overall good-bad separation and use a threshold).
Asked
Active
Viewed 110 times
0
-
Then use logistic regression, which models probabilities (so risks) and not only classification. See https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 – kjetil b halvorsen Oct 31 '17 at 20:18
1 Answers
1
Two ways to do this. 1. Dump the prediction probabilities and take the ones with a 99% probability of being good, then a 98% probability, and so on.
- Use the cost-sensitive metaclassifier (with the classifier of your choice as the base classifier).
Instead of the standard cost matrix
0 1
1 0
use a cost matrix that emphasizes precision, e.g.
0 3
1 0
(this assumes that true positives would be in the lower right of your confusion matrix)
For example, if we use the diabetes.arff dataset with the J48 classifier, the confusion matrix is
=== Confusion Matrix ===
a b <-- classified as
407 93 | a = tested_negative
108 160 | b = tested_positive
but if we change that cost matrix, we get this:
=== Confusion Matrix ===
a b <-- classified as
477 23 | a = tested_negative
190 78 | b = tested_positive

zbicyclist
- 3,363
- 1
- 29
- 34