Weka - only care about "top" instances

Question

I'm trying to classify instances between 2 classes ("good" and "bad"). My ultimate goal is to be able to predict good instances, but I don't need to identify all good instances. For example, say I have 1000 good ones and 1000 bad ones in my test data, I would like to pick the 100 that are more likely to be good, with minimal number of false positives. I can theoretically build a model and use a threshold that predicts 10% of good ones, but what I would really like to do is have the algorithm optimize towards my goal instead (and not pick the algorithm with the optimal overall good-bad separation and use a threshold).

Then use logistic regression, which models probabilities (so risks) and not only classification. See https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044#127044 — kjetil b halvorsen, Oct 31 '17 at 20:18

score 1 · Accepted Answer · answered Oct 31 '17 at 20:12

Two ways to do this. 1. Dump the prediction probabilities and take the ones with a 99% probability of being good, then a 98% probability, and so on.

Use the cost-sensitive metaclassifier (with the classifier of your choice as the base classifier).

Instead of the standard cost matrix

0 1
1 0

use a cost matrix that emphasizes precision, e.g.

0 3
1 0

(this assumes that true positives would be in the lower right of your confusion matrix)

For example, if we use the diabetes.arff dataset with the J48 classifier, the confusion matrix is

=== Confusion Matrix ===

   a   b   <-- classified as
 407  93 |   a = tested_negative
 108 160 |   b = tested_positive

but if we change that cost matrix, we get this:

=== Confusion Matrix ===

   a   b   <-- classified as
 477  23 |   a = tested_negative
 190  78 |   b = tested_positive

Weka - only care about "top" instances

1 Answers1