9

Now I have binary classification problem with positive samples roughly 100 times the number of negative samples. In this case the normal accuracy measure (predict == label) is not a good measure. What other measures there are? Is precision, recall for negative sample fine or F-1 measure the best? If the model is a probability model, is AUC (Area under curve) a good measure?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
user83176
  • 181
  • 3
  • 5

1 Answers1

12

Any method that uses arbitrary cutoffs and dichotomizes continuous information such as probability of class membership is problematic. And classification accuracy is an improper accuracy scoring rule, being optimized by the wrong model. Concordance probability ($c$-index; ROC area) is a measure of pure discrimination. For an overall measure consider the proper accuracy score known as the Brier score or use a generalized likelihood-based $R^2$ measure.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    Thanks!I was wondering if I use the AUC in the test set to represent the classifier's performance on the test set, or use F1 score after choosing a threshold, what the problem is? For Brier score, if the data is imbalance, and it may give very high probability(0.99..) to +1 class, and a little bigger than 0.5 probability to -1 class,then the score also will be very low, but this classifier is considerably bad. – user83176 Jul 26 '15 at 13:53
  • I think AUC is only overall performance since the threshold need to be determined after all. So the measure after threshold being determined is needed. – user83176 Jul 26 '15 at 13:55
  • The Brief score does not use thresholds in any way. It's interpretation will vary a bit depending on prevalence of $Y=1$. Don't choose a threshold at any rate, unless you possess the utility function. There is no need to choose a threshold in most cases. Don't use a measure that uses a threshold as this will be very arbitrary and imprecise and often represents an improper accuracy scoring rule. – Frank Harrell Jul 26 '15 at 16:21
  • I see. Because we finally use a Logistic Regression, and compute its predicted class and find the precision,recall,F1 and something. I will see if Brier score can give a very good measure for imbalance data. – user83176 Jul 26 '15 at 17:10
  • 4
    Brier score has been used for very imbalanced data since 1951. It's what the US Weather Service uses for judging the accuracy of rainfall forecasts. I don't think the precision, recall, F1 are proper scoring rules. You can supplement the Brier score with the $c$-index (concordance probability; ROC area) which require no thresholding. – Frank Harrell Jul 26 '15 at 17:50
  • 3
    Thanks for your answer, but why is Brier score useful for imbalanced data? Essentially, Brier score is the mean squared error of the forecast and the forecast of very rare events should have little effect on the mean, souldn't it? – Funkwecker Jun 06 '17 at 18:44
  • 4
    Forecasts of rare events have the "right" effect on the mean, i.e., mean predicted probability of the event = overall proportion of events. The Brier score works no matter what the prevalence of events. For a measure of pure discrimination, the $c$-index (AUROC) has an interpretation that in fact is completely free of prevalence. – Frank Harrell Jun 07 '17 at 12:26