Can I compare the output probabilities of two machine learning models?

Question

I'm sorry if this is a silly question.

Suppose there are two logistic regression models $M_1$ and $M_2$ trained on the same (or similar) dataset, and their outputs of given input $x$ are $P_{M_1}(y \mid x)$ and $P_{M_2}(y \mid x)$, respectively.

What I feel confused about is, for any given input $x$, could I determine which model's prediction is more reliable by comparing this output probability? For example, if $P_{M_1}(y=k \mid x) > P_{M_2}(y=k \mid x)$, then the output of $M_1$ is more reliable? If cannot, how to compare these two models and make a decision that which one should be chosen?

This sounds appealing at first. After all, if one mode predicts $0.6$ and another $0.9$, you’d trust the confident mode making the $0.9$ prediction, right? The trouble is that the right answer might be that your case has ambiguity, and about $60\%$ of such cases will go for one class and $40\%$ for the other. You might be interested in evaluating model calibration and performance on strictly proper [tag:scoring-rules], such as log loss and Brier score. — Dave, Dec 01 '21 at 02:27
@Dave, thank you, I completely agree with your comment. My empirical result shows that using the maximum one as the final output is not a good choice even though these two models are calibrated. — Ze-Nan Li, Dec 01 '21 at 02:59
@Dave, In my humble opinion, from the viewpoint of a frequentist, $P_{M_1}(y \mid x)$ and $P_{M_2}(y \mid x)$ are two estimates of the ground-truth $P(y \mid x)$, in this sense, comparing these two probabilities does not make any sense. However, from the viewpoint of a bayesian, $P_{M_1}(y \mid x)$ and $P_{M_2}(y \mid x)$ indicates the uncertainty of their predictions, and thus can be compared. So I am not sure whether I misunderstand the output probability..? — Ze-Nan Li, Dec 01 '21 at 03:09

score 1 · Accepted Answer · 2021-12-01T03:12:19.283

No it's not a silly question. There are not a lot of statistics for comparing statistics. For example, you have a lot of t-tests (logistic models) and want to perform a hypothesis test to determine which t-test is the most significant. That is, hypothesis tests for hypothesis test results.

For logistic, compare each model using a variety of test results for each like coefficients, overall chi-squared p-value, Hosmer-Lemeshow statistic and table, deviance GOF. For machine learning issues, there is the ROC-AUC, sensitivity, and specificity for each model, as well as PV+, PV- (predictive value plus, minus - which is hinged to prevalence, or proportion of outcomes with a one).

Things get complicated however, because there can be issues like the input features (predictors used for each), and the cross-validation methods used for each model.

But overall, the AUC-ROC would be a good start. This is the receiver operator characteristic curve - area under the curve based on a plot of sensitivity vs. 1-specificity. People who present ML classification results at meetings/conferences for e.g. a lot of biological markers as predictors for class outcomes will simply go through several slides entitled "AUC", or AUC-ROC, listing how AUC changes with use of different combinations of features. AUC-ROC incorporates both sensitivity and specificity, which is much more informative than recall or classification accuracy, which is your $M_1$ and $M_2$.

In fact, if you present results based on AUC for different combinations of input features, you only need to mention which classifier was used, because AUC can be calculated for any classifier. Thus, you could have one slide of AUC for various mixtures of features that's based on multiple classifiers, where the AUC from multiple classifiers for a specific set of features is called "ensemble classifier fusion."

The point in mentioning the above is that an experienced ML analyst would quickly get away from what you are asking and launch into a lot of other things (like ensemble methods, each which use CV and multiple classifiers) without getting tripped up on looking for statistical tests to prove which AUC is the best. At that point however, you have to look at overfitting and the bias/variance dilemma, effect of the "curse of dimensionality" of each feature set on each classifier.

Thanks for the nice answer, but I think these statistics actually measure model performance (statistically) on the whole dataset, and maybe cannot work on the individual input? I would also like to know if I can conduct the comparison on the predicted probabilities of the two models.. — Ze-Nan Li, Dec 01 '21 at 03:17
I guess this is related to the actual meaning of the output probability. If this probability indicates the uncertainty of the prediction, and thus can be compared, right? — Ze-Nan Li, Dec 01 '21 at 03:18
Don't conflate the current universe of hypotheses tests for logistic regression with a contrived test that you want. Maybe look at meta-analysis for multiple logistic regression models. Also, if you want to test if two probabilities are significantly different, then apply a test for equality of two proportions, p1, and p2, but don't bring logistic into the issue. You'll need to know standard error of p1 and p2 as well, which is hinged to the sample sizes for each. But looking at your question again, it would be the AUC that changes as different predictors are used. — , Dec 03 '21 at 01:02
I don't think you picked up on the idea that AUC is much more informative than prediction probability, since it's the integral area under the curve of a plot of sensitivity vs. 1-specificity. Also, models are compared on performance and goodness-of-fit statistics (Pearson, Deviance, Hosmer-Lemeshow for logistic). AUC is also the litmus test for how a model's predictive value changes with different inputs $(X_1, X_2, \ldots, X_p)$. — , Dec 03 '21 at 01:06
OK, I agree with you, and I will try the AUROC. Thank you for the clarification. — Ze-Nan Li, Dec 03 '21 at 05:00

score 1 · Answer 2 · answered Dec 02 '21 at 14:37

This sounds appealing at first. After all, if one model predicts $0.6$ and another $0.9$, you’d trust the confident mode making the confident $0.9$ prediction and not the wishy-washy $0.6$ prediction, right? The trouble is that the right answer might be that your case has ambiguity, and about $60\%$ of such cases will go for one class and $40\%$ for the other. You might be interested in evaluating model calibration and performance on strictly proper scoring-rules, such as log loss and Brier score, that seek out the correct probability predictions.

Following an answer by Stéphane Laurent, let's do a simulation in R to see what it means to predict the correct (unobservable) probability.

set.seed(2021)
N <- 1000
x <- runif(N, -2, 2)
z <- x
pr <- 1/(1 + exp(-z)) # This is the true probability!
y <- rbinom(N, 1, pr)
L <- glm(y ~ x, family = binomial)

In this simulation, the y variable mimics the discrete categories that we would observe. However, there are y values of $0$ that correspond to $P(0)<0.5$, just by the luck of the draw. That is, they turned out to be $0$, even though, given x, the result is more likely to be $1$ than $0$. Therefore, we want our model to predict probability values that come close to pr. In expected value, strictly proper scoring rules like log loss and Brier score are minimized by predicting pr.

Thanks, this example is very clear. – Ze-Nan Li Dec 03 '21 at 04:55 — Ze-Nan Li, Dec 03 '21 at 04:55

Can I compare the output probabilities of two machine learning models?

2 Answers2

Linked