Is this a good strategy to set a threshold on softmax probabilities in a multi-class classification task?

Question

I have a large image dataset that was classified by a ConvNet into different classes (objects). For each image the top-1 softmax probability is given, ranging between 0 and 1. It´s the output of a multi-class classification task, so the softmax classification output contains multiple values, for example (0.6, 0.1, 0.2, 0.1). The top-1 probability, in this example, would be 0.6. In my dataset the top-1 softmax probability of many images is rather low (e.g. 0.1), meaning that the probability that the image shows the predicted class is low. Now I am wondering if and how I should set a threshold on the softmax probabilities. My approach was to compare the predicted labels with ground-truth labels (which are available for ca. 10% of the whole dataset), plot a ROC curve and calculate the Youden Index and optimum cut-off point. Then I used this optimum cut-off point as a threshold for the softmax probabilities and removed all images from the dataset with a top-1 softmax probability below this cut-off point. This reduced my dataset to ~1/4 of its original size.

My questions are: - Can I use the described approach to define a threshold for the top-1 softmax probabilities? - Are there other approaches, for example defining a threshold for each class? And how one would do this?

Shameless promotion of proper scoring rules (shameless because I’m linking my own question, though Kolassa gives a nice answer): https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email — Dave, Jun 09 '20 at 01:47
Implicitly, choosing the "best" threshold implies that there is some criterion for deciding that some threshold is better than another. There is no context-free "best threshold," because in all cases except a few corner-cases, each choice of a threshold implies a different tradeoff among true positive and false positives. To be answerable, you'll need to clarify what problem you're trying to solve and how setting a threshold solves it. As it stands, you're describing a solution to an **unstated** problem. — Sycorax, Sep 13 '21 at 21:11

score 0 · Answer 1 · answered Jun 09 '20 at 01:37

I don't think I've heard of something like this being done before in the way you're describing. Can you do it? Yes. Is it a good idea? Well, I'm not so sure -- as I understand, you're essentially using your trained model to cherry-pick your data so that your dataset only has data points where your model achieves a certain confidence -- which means your model metrics are going to be biased.

If you want to threshold, an alternative way to do this would be to not modify the dataset in this way, but use your threshold in this way instead:

If top-1 probability > threshold, output top-1 class as the prediction.
Else if top-1 probability < threshold, output "Don't know."

To choose a threshold -- I'm unsure how the ROC curve method works, since you'll have to binarize your labels for that to even make sense. For the multi-class case, I suppose you could micro- or macro-average a bunch of one-versus-rest binary classifiers for each class on your dataset.

As for per-class thresholds -- that might be overkill. I would try simply doing a universal threshold first, and if that yields undesirable results (you'll have to define that for your case), you could potentially try the same threshold-picking strategy, and adopt the same one-versus-all strategy to derive per-class ROC curves.

Thank you for this answer. Actually, the model was already trained and tested on some other training and test data. Now, I want to use the trained model to classify a completely new unseen dataset. So, picking only the data points with high softmax probabilities will not change the metrics previously calculated. You propose to micro- or macro-average one versus-all binary classifiers for each class of my dataset. Could you maybe extend a little bit on how you would do this? I also have some classes for which I don´t have ground-truth labels. — albren, Jun 09 '20 at 09:10

Is this a good strategy to set a threshold on softmax probabilities in a multi-class classification task?

1 Answers1