Regarding precision and recall for the highly unbalanced validation data set

Question

Possible Duplicate:
Optimising for Precision-Recall curves under class imbalance

I built a classification model and tested it against a validation data set. The positive set is composed of 86 cases and the negative set is composed of 1256 cases. The confusion matrix is as follows

                     True positive  True negative   precision
   Predict positive    55               338          13.99%
   Predict negative    31               918          96.73%
           Recall      63.95%       73.09%

The precision and recall for this classifier is not good, especially for the positive precision. However, the negative cases are much more than the positive case. I am not quite sure that, for this kind of unbalanced data, can we still use the precision and recall as the performance evaluation as usual?

score 5 · Accepted Answer · answered Oct 30 '12 at 22:12

5

I work in biomedical text classification, where this sort of situation happens all the time. You're exactly right--precision and recall aren't all that informative for highly-skewed data. I tend to use AUC as my performance metric, as it's not sensitive to class distribution.

answered Oct 30 '12 at 22:12

Kyle.

1,550
1
11
22

Thanks for the response. At present, the predicted positive case contains so many true negative ones. Are there any suggestions on handling it? – bit-question Oct 31 '12 at 13:54
AUC isn't really an alternative to precision/recall, though. Precision/recall are summaries of the confusion matrix for a single parameter setting, where AUC averages over the parameter settings – Ben Allison Oct 31 '12 at 14:36
@bit-question the obvious answer is, tune your threshold to lower your recall on positives at the expense of improving your precision. Are there settings for your classifier where you get similar recall but much higher precision? – Ben Allison Oct 31 '12 at 14:43
@Ben, I have tried several different parameter settings, and the positive precision do not improve a lot. They tend to locate in the range of [9% 13%]. The classifier were libsvm, using either linear kernel or RBF kernel. I even got some results where the negative ones get the highest score values. – bit-question Oct 31 '12 at 16:00
@BenAllison: True, I was interpreting the question along the lines of "how do I evaluate system performance on skewed data?" – Kyle. Oct 31 '12 at 16:16
@bit-question SVMs are a little clunky for adjusting bias, but there's a simple (if inelegant) method. Take your best performing classifier, and take its predictions (dist to hyperplane) for every instance. Normal SVM uses the sign function on these distances to provide classifications, but you can generate ROC curves by allowing any threshold. Basically, by setting your threshold above zero, you require examples to be more positive to qualify for class membership---can you get useful performance by tuning this? – Ben Allison Oct 31 '12 at 16:38

score 1 · Answer 2 · answered Oct 31 '12 at 16:47

1

You could introduce a cost function, consistent with your application, with values for TP, FP, TN, FN and optimise your predictors for that.

answered Oct 31 '12 at 16:47

image_doctor

750
5
9

score 0 · Answer 3 · answered Oct 31 '12 at 14:42

I think you need to be clearer what you mean when you say "Not valid": in the sense that they summarise the contingency table, they are valid, but they are biased in the case of highly imbalanced data. One alternate measure you can look at, which tends to be more stable across class balance, is the mean of true positive rate and (1 - false positive rate).

You should be careful about what you want to do with this, though: precision on your positive class is a useful metric to have, because optimising that recall/precision tradeoff on an infrequently occurring class is often the goal of the practical application of classifiers.

Regarding precision and recall for the highly unbalanced validation data set

3 Answers3