Can SML with class labels optimize directly the class probabilities instead of class assignment accuracy?

Question

This is, for now, a purely theoretical question. I am interested in using Supervised Machine Learning to predict, for each test observation, what is the probability it corresponds to each of the available classes. So, if the problem allows up to $k$ classes and letting the matrix $X$ represent the test data, I want to assign a vector of probabilities $[p_1, p_2, \cdots, p_k]$ to each row of $X$, instead of one of the $k$ possible classes.

I know that while not all learner algorithms allow for the calculation of said probabilities (for example SVM), many do - with more or less bias that needs to be accounted for via [probability calibration][1]. What confuses me, however, is the following.

All those learners that allow the estimation of class assignment probability, they do so as just an externality to the actual goal of assigning a class to each test observation. That is, the function they are optimizing is the error in class assignment - measured by the class classification accuracy. In other words, the learners are maximizing the percentage of correctly assigned classes.

However, if like myself one is not only interested in the class assignment probabilities for each test observation as an incidental thing (for example just to measure uncertainty), but rather is primarily interested in those probabilities as the main goal, then it seems to me that optimizing based on class assignment accuracy is not ideal. After all, if the loss function relates to simple right/wrong class assignment, information would be lost.

An extreme example would be the following. Suppose $m=4$, meaning a problem with four available classes - say classes A, B, C and D. Now, think of a test observation whose true class would be C, but the predicted class probabilities are, respectively, $p_A = 0.005, p_B=0.50, ,p_C=0.49, p_D=0.005$. While strictly speaking the class assigned in the end was B, due to $p_B$ being the greatest, it is obvious that it the correct assignment was not "far" from happening.

Hence, my question. In Supervised Machine Learning problems where each labeled observation has one class as a label, is it possible and does it make sense to have learner algorithms optimize not based on the right/wrong accuracy of class assignment but really based on the class assignment probabilities themselves? If so, what would be good metrics - for example, average MSE for each class probability across predictions?

This is exactly what proper [scoring rule](https://stats.stackexchange.com/search?q=proper+scoring+rule)s do! Other references: [1](https://www.fharrell.com/post/class-damage/), [2](https://www.fharrell.com/post/classification/). — Dave, Jun 23 '21 at 16:29
@Dave Thanks, that helps a lot and it's good to know I was on the right thinking path. One clarification (after reading the links you sent plus a few others). Then, when using proper scores in this scenario, I assume one should calculate one score (for example Brier's) for each test observation (so over the predicted proportions for each class assignment for that test observation) and then aim to minimize the average of those. Is that correct? — YzSun, Jun 23 '21 at 19:45

score 1 · Accepted Answer · answered Jun 23 '21 at 16:49

Yes, it does make sense to use better objective functions than accuracy, just as Dave says. After all: Why is accuracy not the best measure for assessing classification models? And the problems with accuracy occur not only for evaluation of predictions, but of course also in the fitting step.

And indeed, many methods do work with better objective functions. Classically, one can posit a probability model and maximize the (log-)likelihood. This is the classical statistical approach and is used, e.g., in logistic regression, or multicategory generalizations. Not coincidentally, the log-likelihood corresponds to the log score, which is a proper scoring rule (the tag wiki contains more information).

Can SML with class labels optimize directly the class probabilities instead of class assignment accuracy?

1 Answers1