Let’s say I have two logistic regression models trained on binary 0/1 data. The goal is to predict a continuous value as a score of confidence of a given example belonging to positive class_1 (e.g. “not spam” / “spam”).
To make it clear I don’t regard logistic regression as a classification method at least in the context of the question.
Both perform well in terms of accuracy and f1_score. However, I want to evaluate and compare them based on continues scoring rather than binary accuracy. My understanding binary accuracy is evil.
Although these models may predict scores, the distribution and behavior of the continuous quantity may not match the desired one.
For example, by a given observation bearing some similarity toward class_1, these models (A and B) may produce scores 0.01 and 0.4 respectively. Despite both agree and correctly classify that sample as class_0, I would favor model B because of better reflected sample’s tendency (distance) toward class_1. Manually I would estimate that sample with 0.4999.
What I’m looking for is the loss/metric as a distance to the positive class_1
| y_label_prob - y_hat_prob | -> Huge loss
rather than just misclassification
| y_label_class - y_hat_class | --> No loss
Unfortunately, I don’t have continuous labels (y_label_prob) to go into pure regression. If I had I could just compute squared errors. Instead, I trained multiple binary classifiers and used their scores mean as continuous labels.
What might be recommendation to evaluate performance with respect to continuous quantity by having 0/1 labels?