0

I want to understand how I can evaluate a (multiclass) classifier's performance when this classifier outputs not the predicted label, but a score.

Imagine a classifier that tries to predict the house value based on some features, it outputs a score [0,1] where 0 means it predicts 'low' value and 1 means 'high' value.

I also have some test set, where each example contains one of the three class labels: ['affordable', 'expensive', 'very expensive'] so after running the test set through the classifier, I will have output like this:

example      label               score
 A         'affordable'           0.23
 B         'very expensive'       0.56
 C         'affordable'           0.54
 D         'expensive'            0.80
 ... 

You can see that this model makes a mistake in D, because it gives a higher score (0.8) than B (0.56) but the true label indicates D should have a smaller value than B.

my question is, how do I assess the performance of this classifier?

I understand that if I somehow am able to map the scores to the categories (finding the decision boundary), then I can perform the usual confusion matrix/ ROC analysis. But how can I map the scores?

Thanks in advance!

PS you don't know the implementation detail of said classifier/regressor.

Wei
  • 101
  • 1

1 Answers1

0

you need to choose a metric you care about. If the metric you care about takes as input confidence scores, then you use confidence scores. If the metric you care about is AUC then, as you said - you need to define a mapping from scores to predicted label. There are many possible ways to do this.

One way is to optimize/learn the mapping on a validation set. I.e. predict label s.t. score(label) > c where c is a parameter which optimizes some metric like AUC/f1 score on a validation set.

See this question for choosing thresholds

user3494047
  • 498
  • 3
  • 13