How to evaluate logistic regression on continuous metric by having binary 0/1 data

Question

Let’s say I have two logistic regression models trained on binary 0/1 data. The goal is to predict a continuous value as a score of confidence of a given example belonging to positive class_1 (e.g. “not spam” / “spam”).

To make it clear I don’t regard logistic regression as a classification method at least in the context of the question.

Both perform well in terms of accuracy and f1_score. However, I want to evaluate and compare them based on continues scoring rather than binary accuracy. My understanding binary accuracy is evil.

Although these models may predict scores, the distribution and behavior of the continuous quantity may not match the desired one.

For example, by a given observation bearing some similarity toward class_1, these models (A and B) may produce scores 0.01 and 0.4 respectively. Despite both agree and correctly classify that sample as class_0, I would favor model B because of better reflected sample’s tendency (distance) toward class_1. Manually I would estimate that sample with 0.4999.

What I’m looking for is the loss/metric as a distance to the positive class_1

| y_label_prob - y_hat_prob | -> Huge loss

rather than just misclassification

| y_label_class - y_hat_class | --> No loss

Unfortunately, I don’t have continuous labels (y_label_prob) to go into pure regression. If I had I could just compute squared errors. Instead, I trained multiple binary classifiers and used their scores mean as continuous labels.

What might be recommendation to evaluate performance with respect to continuous quantity by having 0/1 labels?

Of possible interest: https://stats.stackexchange.com/q/464636/247274 — Dave, Aug 26 '20 at 17:18

score 6 · Accepted Answer · edited Aug 26 '20 at 17:54

6

Remember that a logistic regression outputs a probability, not a category. Your idea for using square loss is fine. In fact, that is known as the Brier score.

If your label is $1$ and your predicted probability is $0.75$, your Brier score loss for that point is $(1-0.75)^2 = 0.0625$.

If your next label is $0$ and your predicted probability is $0.6$, your Brier score loss for that point is $(0-0.6)^2=0.36$.

Add them up and get $0.4225$ as the Brier score for this two-point model.

$$ \text{Brier Score} $$

$$ \sum_{i=1}^n (y_i - \hat{p}_i)^2 $$

Brier score is one example of a strictly proper scoring rule. The other famous one, which might be preferred, is the log loss: $\sum_i y_i \log\hat{p}_i + (1-y_i) \log(1-\hat{p}_i)$.

($y_i$ is the true label; $\hat{p}_i$ is the predicted probability.)

There are other strictly proper scoring rules, but these are the biggies. Notably, absolute loss is not a proper scoring rule: (Why) Is absolute loss not a proper scoring rule?.

edited Aug 26 '20 at 17:54

gung - Reinstate Monica

132,789
81
357
650

answered Aug 26 '20 at 17:28

Dave

28,473
4
52
104

Thank you for the detailed answer. If we consider the case of label 0 from my original example above, the Brier loss would be: Model A: (0 - 0.01)**2=0.0001 Model B: (0 - 0.4)**2=0.16 And we wrongly penalize B while as it produced more accurate score (probability) toward class_1. – belz Aug 26 '20 at 20:15
Taking a distance (error) to the binary label might not be always the “fair” way to penalize. The class label is discrete and might be far away from its perfect probability value. In the diagram above, we penalize B much more than A while as B is closer to the ultimate observation’s probability value. – belz Aug 26 '20 at 20:15
By writing $(0-0.4)^2$, you're assuming that the true category (which you know, since this is supervised learning) is $0$, not $1$. Thus, model **B** misses by more than model **A** and warrants a larger penalty. In expected value, strictly proper scoring rules are uniquely optimized by the correct probabilities. – Dave Aug 26 '20 at 20:25
It’s still not clear for me. Right, the true category is 0. But it just means that the observation belongs (is member of) “ham” distribution. Computing error from 0 does contribute a metric of _classification_ whereas I’m trying to build a _predictive_ model. I see a lot of confusion (with binary problems) when a categorical 0/1 label is taken as continues label while going into regression domain. The categorical label is just a clipped version of continues one (which I’m missing). By labeling that observation by 0.4999, I really want a model which has closest prediction to that value. – belz Aug 27 '20 at 21:17
@kb88 You only observe the discrete classes, so the correct labels are $0$ and $1$, not $0.499$, though $0.499$ might be the correct probability.. – Dave Aug 27 '20 at 22:07

How to evaluate logistic regression on continuous metric by having binary 0/1 data

1 Answers1