When is a proper scoring rule a better estimate of generalization in a classification setting?

Question

A typical approach to solving a classification problem is to identify a class of candidate models, and then perform model selection using some procedure like cross validation. Typically one selects the model with the highest accuracy, or some related function that encodes problem specific information, like $\text{F}_\beta$.

Assuming the end goal is to produce an accurate classifier (where the definition of accuracy is again, problem dependent), in what situations is it better to perform model selection using a proper scoring rule as opposed to something improper, like accuracy, precision, recall, etc? Furthermore, let's ignore issues of model complexity and assume a priori we consider all the models equally likely.

Previously I would have said never. We know, in a formal sense, classification is an easier problem than regression [1], [2] and we can derive tighter bounds for the former than the later ($*$). Furthermore, there are cases when trying to accurately match probabilities can result in incorrect decision boundaries or overfitting. However, based on the conversation here and the voting pattern of the community in regards to such issues, I've been questioning this view.

Devroye, Luc. A probabilistic theory of pattern recognition. Vol. 31. springer, 1996., Section 6.7
Kearns, Michael J., and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on. IEEE, 1990.

$(*)$ This statement might be a little sloppy. I specifically mean that given labeled data of the form $S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ with $x_i \in \mathcal{X}$ and $y_i \in \{1, \ldots, K\}$, it seems to be easier to estimate a decision boundary than accurately estimate conditional probabilities.

score 4 · Answer 1 · answered Mar 24 '14 at 18:44

4

Think of this as a comparison between the $t$-test/Wilcoxon test and the Mood median test. The median test uses optimum classification (above or below the median for a continuous variable) so that it only loses $\frac{1}{\pi}$ of the information in the sample. Dichotomization at a point different from the median will lose much more information. Using an improper scoring rule such as proportion classified "correctly" is at most $\frac{2}{\pi}$ or about $\frac{2}{3}$ efficient. This results in selection of the wrong features and finding a model that is bogus.

answered Mar 24 '14 at 18:44

Frank Harrell

74,029
5
148
322

I guess I don't understand why dichotomization is relevant. Ultimately the goal is to pick a classifier $h$ from some hypothesis class $H$ such that $P_{(x,y) \sim D}(h(x) \neq y)$ is minimal, given some finite sample $S$ consisting of examples distributed according to $D$. – alto Mar 25 '14 at 01:02
2

The problem is that classification (as opposed to risk prediction) is an unnecessary dichotomization. – Frank Harrell Mar 25 '14 at 02:32
So is it safe to assume the answer to this question is never, provided the goal is Bayes optimal decision making with respect to some utility function and not accurately matching probabilities? – alto Mar 25 '14 at 12:34
The Bayes optimum decision requires well-calibrated predicted risks so the two are linked. The optimum decision does not utilize a dichotomization made earlier in the pipeline but conditions on full information, e.g., $Prob(Y = 1 | X=x)$ not $Prob(Y=1 | X > c)$. – Frank Harrell Mar 25 '14 at 13:29
But assuming we know $D$, the Bayes optimal decision only cares if $P_D(Y=1|X=x) > c$, where $c$ comes from the utility function, correct? – alto Mar 25 '14 at 14:11
Yes but you are in my opinion missing a subtle point. The analyst does not possess the utility function. Only the decision maker does. So our output should normally be $Prob(Y = 1 | X=x)$. – Frank Harrell Mar 25 '14 at 15:40
I think I've explicitly assumed that issue away by stating I'm interested in classification. This is probably where the disagreement is coming from. We're solving two different problems. The goal of a properly posed classification problem is to produce an accurate classifier, not provide input for a decision maker. To give short shrift or outright dismiss the later as a useless endeavor is essentially dismissing the entire field (or at least a big chunk) of Machine Learning. – alto Mar 25 '14 at 16:15
I don't yet understand the original motivation for seeking a classification vs. a prediction. Can you tell us the ultimate goal and also describe why a gray zone is of no interest? Classification is a special case of prediction so I'm not understanding why classification needs to be done by the analyst. – Frank Harrell Mar 25 '14 at 17:07
1

There are numerous real world examples of problems that require making decisions. Normally these are things that are too difficult/boring/costly for a human to do. Things like spam classification, ad-serving, image recognition, content recommendation, speech recognition, etc. In some of these cases (spam classification, content recommendation), saying "I don't know" is the same as saying no. In other cases (speech recognition), the decision is forced. Something was said, and the algorithm needs to output something. – alto Mar 25 '14 at 18:42
1

Nice discussion. In some of the cases such with some spam detectors, you can get an 'uncertain'. I am more concerned with thresholding in problems such as medical diagnosis and prognosis. – Frank Harrell Mar 25 '14 at 19:58
Yes, questioning ones own assumptions and laying out exactly what is being assumed can be **very** enlightening. I'm glad you pointed out diagnosis as well. That people wouldn't be handling that situation correctly is a scary thought indeed. Lastly, since you've mentioned allowing for "uncertain" several times, perhaps you'd be interested in the Knows What it Knows ([KWIK](http://www.research.rutgers.edu/~lihong/pub/Li08Knows.pdf)) framework, which explicitly tries to address this issue in an online learning setting. – alto Mar 25 '14 at 20:31
I think a lot of the discussion revolves around the utility/cost function that was referred to. Many of today's applications of classification come from some of the problems that @alto listed. Hence, if serving the wrong ad to a user has a negligeable cost, "serving" the wrong treatment to a patient has quite different implications. One can see a desire for classification in the former, and a desire for probabilities in the latter. While I see the convenience in some applications there is always the choice a third option: "unsure" (unsure spam, limited user privileges on phones) – Thomas Speidel Mar 25 '14 at 22:47
@FrankHarrell Have you seen [this counterexample](https://stats.stackexchange.com/a/538524/247274) to the full probability information being required for making the optimal decision? – Dave Aug 16 '21 at 16:13
"Accuracy" in effect assumes that a decision must be forced for all customers even when the probability of purchasing is right on the border. The optimal decision is based on the probability of purchasing no matter which accuracy score you use. Decisions must be judged on their expected utilities/losses. – Frank Harrell Aug 16 '21 at 19:48

When is a proper scoring rule a better estimate of generalization in a classification setting?

1 Answers1

Linked