Why is accuracy not the best measure for assessing classification models?

Question

This is a general question that was asked indirectly multiple times in here, but it lacks a single authoritative answer. It would be great to have a detailed answer to this for the reference.

Accuracy, the proportion of correct classifications among all classifications, is very simple and very "intuitive" measure, yet it may be a poor measure for imbalanced data. Why does our intuition misguide us here and are there any other problems with this measure?

Stephan Kolassa · Answer 1 · 2018-10-01T06:22:25.330

187

Most of the other answers focus on the example of unbalanced classes. Yes, this is important. However, I argue that accuracy is problematic even with balanced classes.

Frank Harrell has written about this on his blog: Classification vs. Prediction and Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules.

Essentially, his argument is that the statistical component of your exercise ends when you output a probability for each class of your new sample. Mapping these predicted probabilities $(\hat{p}, 1-\hat{p})$ to a 0-1 classification, by choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component. And here, you need the probabilistic output of your model - but also considerations like:

What are the consequences of deciding to treat a new observation as class 1 vs. 0? Do I then send out a cheap marketing mail to all 1s? Or do I apply an invasive cancer treatment with big side effects?
What are the consequences of treating a "true" 0 as 1, and vice versa? Will I tick off a customer? Subject someone to unnecessary medical treatment?
Are my "classes" truly discrete? Or is there actually a continuum (e.g., blood pressure), where clinical thresholds are in reality just cognitive shortcuts? If so, how far beyond a threshold is the case I'm "classifying" right now?
Or does a low-but-positive probability to be class 1 actually mean "get more data", "run another test"?

Depending on the consequences of your decision, you will use a different threshold to make the decision. If the action is invasive surgery, you will require a much higher probability for your classification of the patient as suffering from something than if the action is to recommend two aspirin. Or you might even have three different decisions although there are only two classes (sick vs. healthy): "go home and don't worry" vs. "run another test because the one we have is inconclusive" vs. "operate immediately".

The correct way of assessing predicted probabilities $(\hat{p}, 1-\hat{p})$ is not to compare them to a threshold, map them to $(0,1)$ based on the threshold and then assess the transformed $(0,1)$ classification. Instead, one should use proper scoring-rules. These are loss functions that map predicted probabilities and corresponding observed outcomes to loss values, which are minimized in expectation by the true probabilities $(p,1-p)$. The idea is that we take the average over the scoring rule evaluated on multiple (best: many) observed outcomes and the corresponding predicted class membership probabilities, as an estimate of the expectation of the scoring rule.

Note that "proper" here has a precisely defined meaning - there are improper scoring rules as well as proper scoring rules and finally strictly proper scoring rules. Scoring rules as such are loss functions of predictive densities and outcomes. Proper scoring rules are scoring rules that are minimized in expectation if the predictive density is the true density. Strictly proper scoring rules are scoring rules that are only minimized in expectation if the predictive density is the true density.

As Frank Harrell notes, accuracy is an improper scoring rule. (More precisely, accuracy is not even a scoring rule at all: see my answer to Is accuracy an improper scoring rule in a binary classification setting?) This can be seen, e.g., if we have no predictors at all and just a flip of an unfair coin with probabilities $(0.6,0.4)$. Accuracy is maximized if we classify everything as the first class and completely ignore the 40% probability that any outcome might be in the second class. (Here we see that accuracy is problematic even for balanced classes.) Proper scoring-rules will prefer a $(0.6,0.4)$ prediction to the $(1,0)$ one in expectation. In particular, accuracy is discontinuous in the threshold: moving the threshold a tiny little bit may make one (or multiple) predictions change classes and change the entire accuracy by a discrete amount. This makes little sense.

More information can be found at Frank's two blog posts linked to above, as well as in Chapter 10 of Frank Harrell's Regression Modeling Strategies.

(This is shamelessly cribbed from an earlier answer of mine.)

EDIT. My answer to Example when using accuracy as an outcome measure will lead to a wrong conclusion gives a hopefully illustrative example where maximizing accuracy can lead to wrong decisions even for balanced classes.

edited Oct 01 '18 at 06:22

answered Nov 09 '17 at 08:28

Stephan Kolassa

95,027
13
197
357

1

Agree (+1) but not all algorithms return probabilities and even if they do, not always they are well calibrated, so in many cases you would not look at the probabilistic output. – Tim Nov 09 '17 at 09:00
9

@Tim Frank's point (that he discussed in numerous answers on our site and elsewhere), as I understand it, is that if a classification algorithm does not return probabilities then it's garbage and should not be used. To be honest, most of the commonly used algorithms do return probabilities. – amoeba Nov 09 '17 at 09:17
13

I'd say that an algorithm that takes past observations and outputs only classifications without taking the points above into account (e.g., costs of mis-decisions) conflates the statistical and the decision aspect. It's like someone recommending a particular type of car to you without first asking you whether you want to transport a little league baseball team, a bunch of building materials, or only yourself. So I'd also say such an algorithm would be garbage. – Stephan Kolassa Nov 09 '17 at 09:23
14

I was going to write an answer, but then didn't need to. Bravo. I discuss this with my students as a "separation of concerns" between statistical modeling and decision making. This type of concept is very deeply rooted in engineering culture. – Matthew Drury Nov 09 '17 at 14:34
It would be more helpful if the author could also situate the response in the context of imbalanced datasets, as noted in the "first part" of the question. – chainD Nov 10 '17 at 00:28
3

Probability modeling is more or less immune to the "issues" of unbalanced data. – Matthew Drury Nov 10 '17 at 03:01
1

@MatthewDrury And why is that? In the hypothetical situation when 95% training data is class A and only 5% is class B and so stupid classifier always outputting A would have impressive 0.95 accuracy, a no less stupid probability predictor always predicting P(A)=0.95 would also have good score. What's the big difference? – amoeba Nov 10 '17 at 07:48
5

@chainD: my point is that accuracy is problematic even in the context of balanced datasets. – Stephan Kolassa Nov 10 '17 at 07:48
2

@StephanKolassa Perhaps it does make sense to stress in your answer somewhere that your answer discusses a *different* issue with accuracy compared to most of the other answers (that focus on unbalanced datasets). And if there is a connection between these two issues (as Matthew wrote above), then it's definitely worth discussing it. – amoeba Nov 10 '17 at 07:50
3

@amoeba: suppose your rare class is people suffering from a disease (which can be much rarer than 5%). The naive classifier says that everyone is healthy, end of discussion. The naive probabilistic predictor says that everyone has a small chance of having the disease. Still the population prevalence, because it's a *naive* predictor, but it will still point to the fact that more tests or a more sophisticated model might be useful. – Stephan Kolassa Nov 10 '17 at 07:51
@amoeba: good point there. I edited. – Stephan Kolassa Nov 10 '17 at 07:54
If you evolve a neural network or similar and use accuracy to decide whether a mutation is good or bad there's a tendency to converge to constant 0 or constant 1 (giving you either .6 or .4 accuracy) and if you use an error function it tends to converge to constant 0.5. In these cases maximizing for sensitivity AND specificity may be a better choice even if you don't have extreme imbalance. – mroman Nov 10 '17 at 12:41
2

@StephanKolassa Thanks for adding the clarification note to your post. A quick question regarding your response to amoeba's comment (@amoeba hope you don't mind): I am wondering why a result like "everyone has a small chance of having the disease" might "point to [...] more tests or a more sophisticated model" but a prediction like "everyone is healthy" not "point to" anything? This sounds like a call for the analyst to make, who should perhaps be as concerned (if not more) about running additional tests if the prediction is "everyone is healthy"? Why is it the "end of discussion"? – chainD Nov 10 '17 at 19:22
10

@chainD: if your classifier (remember, it's the one with the *highest accuracy*) says that "everyone in this sample is healthy", then what doctor or analyst would believe that there is more to the story? I agree that in the end, it's a call for the analyst to make, but "everyone is healthy" is far less helpful to the analyst than something that draws attention to residual uncertainty like the 95%/5% prediction. – Stephan Kolassa Nov 10 '17 at 20:43
"if your classifier (remember, it's the one with the highest accuracy) says that "everyone in this sample is healthy", then what doctor or analyst would believe that there is more to the story?" Every doctor or analyst worth the penny they are paid? :) But, of course, I agree with you on your other point regarding which of the two indicators is typically more helpful in such situations. – chainD Nov 10 '17 at 21:53
1

Perhaps there should be a sentence about those proper scoring rules to make it more explicit that if you want to compare the predicted probabilities to the real probabilities with such a loss function, you actually need those true probabilities. You would routinely not have them because there are only true discrete classes for records and no information about the probability of them belonging to those true classes. – David Ernst Nov 11 '17 at 06:21
@DavidErnst: no, scoring rules do not use true probabilities, they use *observed outcomes*, I'll add "observed" to make this a little clearer. A loss function that depends on unobservable variables would not be very useful. The idea is that we take the average over the scoring rule evaluated on multiple (best: many) observed outcomes and the corresponding predicted class membership probabilities, as an estimate of the expectation of the scoring rule. – Stephan Kolassa Nov 11 '17 at 10:10
16

@StephanKolassa 's answer and comments are superb. Someone else comment implied that there is a difference in how this is viewed depending on which culture you are part of. This is not really the case; it's just that some fields bothered to understand the literature and others didn't. Weather forecasting, for example, has been at the forefront and has used proper scoring rules for assessing forecaster accuracy since at least 1951. – Frank Harrell Nov 11 '17 at 12:53
@Stephan Molasses I know that most don't scoring rules don't. The point is that there might be exceptions where there is label uncertainty expressed through true label probabilities. Now, I don't argue those need to be mentioned in this post here. Quite the contrary, I argue that the generalized notation of p and 1-p which is designed to capture those cases when they happen does not intuitively convey that p will most of the time be zero or one. It conveys that one would need true label probabilities. So I would just suggest to add a sentence: p will in most applications be the true *outcome* – David Ernst Nov 11 '17 at 13:41
1

Using the accuracy score as a test statistic is a bad idea. See [this paper](https://arxiv.org/abs/1608.08873) inspired by @FrankHarrell ' s response. – JohnRos Nov 12 '17 at 09:25
3

Great answer. I do agree that the decision and modeling are different issues. However, sometimes you cannot separate them cleanly. For example, the decisions to take might impose different borderline where modeling near them are important and far from them much less. – DaL Nov 12 '17 at 11:57
@DavidErnst: I'm afraid I don't really follow [your last comment](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/312787?noredirect=1#comment594772_312787). $(\hat{p},1-\hat{p})$ is the probabilistic classification, but we will always only observe true class membership, not $p$, so scoring rules take $\hat{p}$ and true classes as inputs. Can you clarify? – Stephan Kolassa Nov 12 '17 at 19:53
The description is now less confusing than before when it could easily be read as the scoring rule comparing estimated probabilities $(\hat{p},1-\hat{p})$ with reference quantities $(p,1-p)$, also probabilities. Now it says that the estimated probabilities are compared with the true classes. – David Ernst Nov 12 '17 at 20:17
5

I'd like to chime in with the importance of (strictly) proper scoring rules. And point to an undesirable behaviour of those "proportion of test cases" (accuray, sensitivity, specificity, etc.) that is implicit in this answer but IMHO deserves being stated explicitly. Accuracy & Co. take discrete values, whereas the proper scoring rules (e.g. Brier's score = MSE for classification) are continuous functions: assume a given model is changed *a little*. Accuracy & Co. will stay constant unless the change is large enough to move the class boundary to the other side of one (or more) test cases. ... – cbeleites unhappy with SX Nov 19 '17 at 15:50
4

... then suddenly, a whole additional misclassification is counted. In contrast, a strictly proper scoring rule will immediately react with slight changes to slight in the model. Particularly, optimization strategies as used for model hyperparameters wirll typically assume a behaviour fo the target functional that only proper or strictly proper scoring rules have (but are nevertheless employed using improper scoring rules). Also, in my experience, e.g. Brier's score is subject to much less variance than these proportion-based "hard" figures of merit. (There's no theoretical guarantee for ... – cbeleites unhappy with SX Nov 19 '17 at 15:57
2

... this lower variance: if [and only if] the model happens to output basically only ever 0% or 100% predicted class membership, the variance will be just as bad as with the proportions) – cbeleites unhappy with SX Nov 19 '17 at 15:59
@StephanKolassa The first two bullet points that is mentioned in your answer falls into the scope of Risk Quantification and not Modeling. For example, it’s possible to add a Reliability Function that captures the first two bullet points but these won’t Boost the model performance. Instead, I would talk more about how most measures fail to correctly capture the Ensemble model performances. – HoofarLotusX Sep 27 '18 at 16:49
3

@amoeba, the point being made is precisely that it is SADLY not true that "most of the commonly used algorithms do return probabilities". Most machine learning starts from a 'classification problem' and doesn't produce probabilities naturally: eg SVMs, trees and random forests. [Then you have multiple subsequent papers on how do you create probabilities from eg random forests] – seanv507 Dec 23 '18 at 19:05
Is it correct to say that scoring rules can be used only by the models which generate predicted probabilities? For example, I don't see how to use scoring rules for SVM or random forest. – SiXUlm Aug 20 '19 at 13:15
@SiXUlm: yes, that is correct. [There are Random Forest implementations that yield predictive densities](https://stats.stackexchange.com/q/358948/1352), and this modification is conceptually actually quite straightforward. There is no common variant of SVM that does this, which is one of the great weaknesses of SVMs. (You might be able to do something by bootstrapping and fitting a new SVM each time.) – Stephan Kolassa Aug 20 '19 at 13:19
@StephanKolassa: thanks for the reference, it is interesting. For SVM, I found the so-called Platt scaling https://en.wikipedia.org/wiki/Platt_scaling. – SiXUlm Aug 20 '19 at 13:35
And can scoring rules take into account the costs of False Negative and False Positive, for example a FP is X times more costly than a FN? – SiXUlm Aug 20 '19 at 13:49
@SiXUlm: I would argue that [your question] is a [category error](https://stats.stackexchange.com/a/368979/1352). Scoring rules evaluate *predictive densities*. FP/FN/etc. evaluate a [*decision*](https://stats.stackexchange.com/a/312124/1352). (And a very specific kind of decision, to boot.) Decisions should take predictive distributions into account (which can be assessed by scoring rules), but also the costs of different subsequent decisions. And it simply makes no sense to evaluate both using a common metric. – Stephan Kolassa Aug 20 '19 at 13:56
@StephanKolassa: that's convincing, thanks for pointing out. It takes me some time to study your answers in the related posts and I find them very helpful. I feel that calling accuracy (with or without additional assumption) a scoring rule is inappropriate because such accuracy is a result of the decision, but no decision should be involved in the predictive densities (via scoring rules). Is it correct? – SiXUlm Aug 25 '19 at 20:26
@SiXUlm: yes, exactly. [I write some more along these lines here](https://stats.stackexchange.com/a/359936/1352), which you may already have seen. – Stephan Kolassa Aug 25 '19 at 20:30
@StephanKolassa: yes, I did read them and try to write my own understanding based on what you have written, but probably I overthought. Thanks for confirming it. – SiXUlm Aug 26 '19 at 07:31
Does this have any implications for methods that attempt to change an imbalanced dataset through resampling (undersampling and oversampling)? – user76284 Mar 13 '21 at 23:47
@user76284: over-/undersampling is an attempt to fix the problem with accuracy that is driven by the mistaken belief that the problem stems from the class imbalance, instead of from the fact that accuracy is broken. See https://stats.stackexchange.com/q/357466/1352 and https://stats.stackexchange.com/q/359909/1352. – Stephan Kolassa Mar 14 '21 at 07:01
3

Bit late to the discussion @amoeba says "if a classification algorithm does not return probabilities then it's garbage and should not be used." I think Vladimir Vapnik would disagree, and the SVM has been quite a successful classification algorithm (certainly not garbage). The argument being (see his big gray book for details) that we shouldn't waste modelling resources on features of the problem that don't affect the decision boundary. – Dikran Marsupial Jul 27 '21 at 08:41
1

I think @Benoit_Sanchez's answer makes a good point. Sometimes classification accuracy (or more generally the expected loss) ***is*** the thing we are interested in, so for some problems it is the ideal performance metric. However, that does not mean it is the ideal model selection criterion, for which proper scoring rules are likely to be better (as I found out by performing some experiments - http://theoval.cmp.uea.ac.uk/publications/pdf/ijcnn2006a.pdf ). – Dikran Marsupial Jul 27 '21 at 08:45
1

@DikranMarsupial I said that it's Frank Harrel's point (as I understand it)! Not mine. – amoeba Jul 28 '21 at 20:21
1

Sincere apologies @amoeba, I should indeed have made that clear! – Dikran Marsupial Jul 29 '21 at 06:25

DaL · Answer 2 · 2017-11-13T06:52:08.923

101

When we use accuracy, we assign equal cost to false positives and false negatives. When that data set is imbalanced - say it has 99% of instances in one class and only 1 % in the other - there is a great way to lower the cost. Predict that every instance belongs to the majority class, get accuracy of 99% and go home early.

The problem starts when the actual costs that we assign to every error are not equal. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of sending a healthy person to more tests.

In general, there is no general best measure. The best measure is derived from your needs. In a sense, it is not a machine learning question, but a business question. It is common that two people will use the same data set but will choose different metrics due to different goals.

Accuracy is a great metric. Actually, most metrics are great and I like to evaluate many metrics. However, at some point you will need to decide between using model A or B. There you should use a single metric that best fits your need.

For extra credit, choose this metric before the analysis, so you won't be distracted when making the decision.

edited Nov 13 '17 at 06:52

answered Nov 09 '17 at 07:45

DaL

4,462
3
16
27

4

Great answer - I've proposed a couple of edits just to try and make the point clearer to beginners in machine learning (at whom this question is aimed). – nekomatic Nov 09 '17 at 08:47
1

I'd disagree that it's not a machine learning problem. But addressing it would involve doing machine learning on the meta problem and necessitate the machine having access to some kind of data beyond just the basic classification information. – Shufflepants Nov 09 '17 at 20:41
3

I don't see it as a function of only the data since different goals can lad to different cost/model/performance/metrics. I do agree that in general, the question of cost can be handled mathematically. However questions like the cost of treating patients rely on totally different information. This information needed for the meta data is usually not suitable for machine learning methodology so most of the time it is handled with different methods. – DaL Nov 12 '17 at 12:01
2

By "misdiagnosing a person with the disease", you mean "misdiagnosing a person *who has* the disease (as not having the disease)", right? Because that phrase could be interpreted either way. – Tanner Swett Nov 13 '17 at 01:39
You are right Tanner. I changed the test to make it clearer. – DaL Nov 13 '17 at 06:52
1

Accuracy could be adjusted for prevalence (*a-priori* class membership probability) - regardless of whether the data at hand is balanced or imbalanced, and whether the imbalanced data reflects prevalence for a given use-case/scenario or not. Just as predictive values should not be calculated on "raw" numbers of test cases *unless* the frequencies of the different classes properly reflect class prevalences for the application at hand. – cbeleites unhappy with SX Nov 19 '17 at 15:43
Can you explain more about the adjustment? – DaL Nov 20 '17 at 06:38
Great answer but why not mention precision and recall as useful metrics in imbalanced cases? – Oliver Angelil Dec 20 '17 at 11:39
1

I agree that they are useful. Please note that it is very easy to trade them off (e.g., by the confidence threshold) so you indeed use them together or use a combined measure (e.g., Jaccard. f scores). – DaL Dec 20 '17 at 15:10

score 28 · Answer 3 · edited Nov 13 '19 at 14:10

The problem with accuracy

Standard accuracy is defined as the ratio of correct classifications to the number of classifications done.

\begin{align*} accuracy := \frac{\text{correct classifications}}{\text{number of classifications}} \end{align*}

It is thus an overall measure over all classes and as we'll shortly see it's not a good measure to tell an oracle apart from an actual useful test. An oracle is a classification function that returns a random guess for each sample. Likewise, we want to be able to rate the classification performance of our classification function. Accuracy can be a useful measure if we have the same amount of samples per class but if we have an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a high accuracy but actually perform worse than a test with a lower accuracy.

If we have a distribution of samples such that $90\%$ of samples belong to class $\mathcal{A}$, $5\%$ belonging to $\mathcal{B}$ and another $5\%$ belonging to $\mathcal{C}$ then the following classification function will have an accuracy of $0.9$:

\begin{align*} classify(sample) := \begin{cases} \mathcal{A} & \text{if }\top \\ \end{cases} \end{align*}

Yet, it is obvious given that we know how $classify$ works that this it can not tell the classes apart at all. Likewise, we can construct a classification function

\begin{align*} classify(sample) := \text{guess} \begin{cases} \mathcal{A} & \text{with p } = 0.96 \\ \mathcal{B} & \text{with p } = 0.02 \\ \mathcal{C} & \text{with p } = 0.02 \\ \end{cases} \end{align*}

which has an accuracy of $0.96 \cdot 0.9 + 0.02 \cdot 0.05 \cdot 2 = 0.866$ and will not always predict $\mathcal{A}$ but still given that we know how $classify$ works it is obvious that it can not tell classes apart. Accuracy in this case only tells us how good our classification function is at guessing. This means that accuracy is not a good measure to tell an oracle apart from a useful test.

Accuracy per Class

We can compute the accuracy individually per class by giving our classification function only samples from the same class and remember and count the number of correct classifications and incorrect classifications then compute $accuracy := \text{correct}/(\text{correct} + \text{incorrect})$. We repeat this for every class. If we have a classification function that can accurately recognize class $\mathcal{A}$ but will output a random guess for the other classes then this results in an accuracy of $1.00$ for $\mathcal{A}$ and an accuracy of $0.33$ for the other classes. This already provides us a much better way to judge the performance of our classification function. An oracle always guessing the same class will produce a per class accuracy of $1.00$ for that class, but $0.00$ for the other class. If our test is useful all the accuracies per class should be $>0.5$. Otherwise, our test isn't better than chance. However, accuracy per class does not take into account false positives. Even though our classification function has a $100\%$ accuracy for class $\mathcal{A}$ there will also be false positives for $\mathcal{A}$ (such as a $\mathcal{B}$ wrongly classified as a $\mathcal{A}$).

Sensitivity and Specificity

In medical tests sensitivity is defined as the ratio between people correctly identified as having the disease and the amount of people actually having the disease. Specificity is defined as the ratio between people correctly identified as healthy and the amount of people that are actually healthy. The amount of people actually having the disease is the amount of true positive test results plus the amount of false negative test results. The amount of actually healthy people is the amount of true negative test results plus the amount of false positive test results.

Binary Classification

In binary classification problems there are two classes $\mathcal{P}$ and $\mathcal{N}$. $T_{n}$ refers to the number of samples that were correctly identified as belonging to class $n$ and $F_{n}$ refers to the number of samples that werey falsely identified as belonging to class $n$. In this case sensitivity and specificity are defined as following:

\begin{align*} sensitivity := \frac{T_{\mathcal{P}}}{T_{\mathcal{P}}+F_{\mathcal{N}}} \\ specificity := \frac{T_{\mathcal{N}}}{T_{\mathcal{N}}+F_{\mathcal{P}}} \end{align*}

$T_{\mathcal{P}}$ being the true positives $F_{\mathcal{N}}$ being the false negatives, $T_{\mathcal{N}}$ being the true negatives and $F_{\mathcal{P}}$ being the false positives. However, thinking in terms of negatives and positives is fine for medical tests but in order to get a better intuition we should not think in terms of negatives and positives but in generic classes $\alpha$ and $\beta$. Then, we can say that the amount of samples correctly identified as belonging to $\alpha$ is $T_{\alpha}$ and the amount of samples that actually belong to $\alpha$ is $T_{\alpha} + F_{\beta}$. The amount of samples correctly identified as not belonging to $\alpha$ is $T_{\beta}$ and the amount of samples actually not belonging to $\alpha$ is $T_{\beta} + F_{\alpha}$. This gives us the sensitivity and specificity for $\alpha$ but we can also apply the same thing to the class $\beta$. The amount of samples correctly identified as belonging to $\beta$ is $T_{\beta}$ and the amount of samples actually belonging to $\beta$ is $T_{\beta} + F_{\alpha}$. The amount of samples correctly identified as not belonging to $\beta$ is $T_{\alpha}$ and the amount of samples actually not belonging to $\beta$ is $T_{\alpha} + F_{\beta}$. We thus get a sensitivity and specificity per class:

\begin{align*} sensitivity_{\alpha} := \frac{T_{\alpha}}{T_{\alpha}+F_{\beta}} \\ specificity_{\alpha} := \frac{T_{\beta}}{T_{\beta} + F_{\alpha}} \\ sensitivity_{\beta} := \frac{T_{\beta}}{T_{\beta}+F_{\alpha}} \\ specificity_{\beta} := \frac{T_{\alpha}}{T_{\alpha} + F_{\beta}} \\ \end{align*}

We however observe that $sensitivity_{\alpha} = specificity_{\beta}$ and $specificity_{\alpha} = sensitivity_{\beta}$. This means that if we only have two classes we don't need sensitivity and specificity per class.

N-Ary Classification

Sensitivity and specificity per class isn't useful if we only have two classes, but we can extend it to multiple classes. Sensitivity and specificity is defined as:

\begin{align*} \text{sensitivity} := \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} \\ \text{specificity} := \frac{\text{true negatives}}{\text{true negatives} + \text{false-positives}} \\ \end{align*}

The true positives is simply $T_{n}$, the false negatives is simply $\sum_{i}(F_{n,i})$ and the false positives is simply $\sum_{i}(F_{i,n})$. Finding the true negatives is much harder but we can say that if we correctly classify something as belonging to a class different than $n$ it counts as a true negative. This means we have at least $\sum_{i}(T_{i}) - T(n)$ true negatives. However, this aren't all true negatives. All the wrong classifications for a class different than $n$ are also true negatives, because they correctly weren't identified as belonging to $n$. $\sum_{i}(\sum_{k}(F_{i,k}))$ represents all wrong classifications. From this we have to subtract the cases where the input class was $n$ meaning we have to subtract the false negatives for $n$ which is $\sum_{i}(F_{n,i})$ but we also have to subtract the false positives for $n$ because they are false positives and not true negatives so we have to also subtract $\sum_{i}(F_{i,n})$ finally getting $\sum_{i}(T_{i}) - T(n) + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})$. As a summary we have:

\begin{align*} \text{true positives} := T_{n} \\ \text{true negatives} := \sum_{i}(T_{i}) - T(n) + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n}) \\ \text{false positives} := \sum_{i}(F_{i,n}) \\ \text{false negatives} := \sum_{i}(F_{n,i}) \end{align*}

\begin{align*} sensitivity(n) := \frac{T_{n}}{T_{n} + \sum_{i}(F_{n,i})} \\ specificity(n) := \frac{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})}{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i})} \end{align*}

Introducing Confidence

We define a $confidence^{\top}$ which is a measure of how confident we can be that the reply of our classification function is actually correct. $T_{n} + \sum_{i}(F_{i,n})$ are all cases where the classification function replied with $n$ but only $T_{n}$ of those are correct. We thus define

\begin{align*} confidence^{\top}(n) := \frac{T_{n}}{T_{n}+\sum_{i}(F_{i,n})} \end{align*}

But can we also define a $confidence^{\bot}$ which is a measure of how confident we can be that if our classification function responds with a class different than $n$ that it actually wasn't an $n$?

Well, we get $\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}$ all of which are correct except $\sum_{i}(F_{n,i})$.Thus, we define

\begin{align*} confidence^{\bot}(n) = \frac{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}-\sum_{i}(F_{n,i})}{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}} \end{align*}

Can you please provide any example of calculating Mean Accuracy using confusion matrix. — Aadnan Farooq A, Sep 27 '18 at 01:17
You can find a more detailed description with examples here: https://mroman.ch/guides/sensspec.html — mroman, Sep 27 '18 at 14:40
Reading through it again there's an error in the definition of confidence_false. I'm surprised nobody spotted that. I'll fix that in the next few days. — mroman, Sep 27 '18 at 16:14

score 15 · Answer 4 · edited Apr 04 '19 at 09:47

15

Imbalanced classes in your dataset

To be short: imagine, 99% of one class (say apples) and 1% of another class is in your data set (say bananas). My super duper algorithm gets an astonishing 99% accuracy for this data set, check it out:

return "it's an apple"

He will be right 99% of the time and therefore gets a 99% accuracy. Can I sell you my algorithm?

Solution: don't use an absolute measure (accuracy) but a relative-to-each-class measure (there are a lot out there, like ROC AUC)

edited Apr 04 '19 at 09:47

NelsonGon

113
9

answered Nov 09 '17 at 17:34

Mayou36

1,008
8
19

Nope, AUC is also not appropriate for imbalanced dataset. – SiXUlm Aug 16 '19 at 21:39
@SiXUlm, can you elaborate on that? – Mayou36 Aug 17 '19 at 00:01
AUC is the area under ROC curve. The ROC curve is the plot of TPR vs FPR. Now, in the Bayesian setting, the imbalance is the odd of prior probability: $P(D)/P(D^C)$. The TPR can be seen as $P(T \vert D)$ and FPR can be seen as $P(F \vert D^C)$. The prior probability has nothing to do with the likelihood. – SiXUlm Aug 17 '19 at 23:38
A clearer illustration can be found here: https://www.quora.com/Why-is-AUC-Area-under-ROC-insensitive-to-class-distribution-changes. Have a look at Jerry Ma's answer. – SiXUlm Aug 17 '19 at 23:42
I do still not understand your point. Isn't that (including the Quora) what I am saying in the solution and exactly supporting my answer? The point is that the priors should not affect the metric that measures the performance of the network. What _is_ appropriate depends entirely on your problem, e.g. the best is to optimize for _every possible cut_. So let me know: a) since it is invariant to the priors but sensitive to the performance, _why_ is _that_ inappropriate? b) what else would you think _is_ appropriate or which characteristics is required? – Mayou36 Aug 19 '19 at 17:02
1

Hmm, sorry about that, I think you are correct. Actually I made a logical error and misinterpreted the effect of insensitivity to the performance. – SiXUlm Aug 19 '19 at 20:10
I see, no problem! Thanks for questioning – Mayou36 Aug 20 '19 at 18:58
7

AUC is actually appropriate for an imbalanced dataset, in fact it is one of the better metrics. In the extreme case; it is a relatively insensitive to false positives in the positive fraction compared with precision recall curves, but saying it is inappropiate for class imbalance is just incorrect. – Christopher John Oct 03 '19 at 12:51
Or alternatively you could use accuracy gain, which is the relative improvement in accuracy over just guessing the majority class all the time. It still has some of the problems of accuracy as a metric, such as being brittle, but not this one. – Dikran Marsupial Aug 05 '21 at 07:45
@DikranMarsupial, while it may removes this bias, it still has the flaw I think that you can't compare the numbers. The ROC AUC can be compared amongst different imbalanced datasets. – Mayou36 Aug 05 '21 at 10:07
Yes, you can compare the numbers just as you could AUC or any other metric. As I said auccuracy still has problems, but incommensurability across datasets isn't a problem either (at least no more so than it is for AUC). – Dikran Marsupial Aug 05 '21 at 11:08
please don't use auROC for imbalanced datasets. It will favor the larger class and a high FP rate will go unnoticed. Use auPRC instead if you have an imbalanced dataset. Balanced accuracy can also be used, that is the mean of the different accuracies of each your classes. – David Feb 23 '22 at 05:55
The FPR won't just go up. If you set the output threshold at 50%, it will. But it hopefully does, it's expected from a probabilistic point of view if the model does not do any normalization themselves. Don't set the cut at 0.5 (as you in general should never do anyways!). The auROC is immune against biases in imbalanced datasets as it does not depend on the absolute output value of the classifier. – Mayou36 Feb 23 '22 at 21:22

Dikran Marsupial · Answer 5 · 2021-07-30T08:40:20.783

Here is a somewhat adversarial counter-example, where accuracy is better than a proper scoring rule, based on @Benoit_Sanchez's neat thought experiment,

You own an egg shop and each egg you sell generates a net revenue of 2 dollars. Each customer who enters the shop may either buy an egg or leave without buying any. For some customers you can decide to make a discount and you will only get 1 dollar revenue but then the customer will always buy.

You plug a webcam that analyses the customer behaviour with features such as "sniffs the eggs", "holds a book with omelette recipes"... and classify them into "wants to buy at 2 dollars" (positive) and "wants to buy only at 1 dollar" (negative) before he leaves.

If your classifier makes no mistake, then you get the maximum revenue you can expect. If it's not perfect, then:

for every false positive you loose 1 dollar because the customer leaves and you didn't try to make a successful discount

for every false negative you loose 1 dollar because you make a useless discount

Then the accuracy of your classifier is exactly how close you are to the maximum revenue. It is the perfect measure.

So say we record the amount of time the customer spends "sniffing eggs" and "holding a book with omelette recipes" and make ourselves a classification task:

This is actually my version of Brian Ripley's synthetic benchmark dataset, but lets pretend it is the data for our task. As this is a synthetic task, I can work out the probabilities of class membership according to the true data generating process:

Unfortunately it is upside-down because I couldn't work out how to fix it in MATLAB, but please bear with me. Now in practice, we won't get a perfect model, so here is a model with an error (I have just perturbed the true posterior probabilities with a Gaussian bump)

And here is another one, with a bump in a different place.

Now the Brier score is a proper scoring rule, and it gives a slightly lower (better) score for the second model (because the perturbation is in a region of slightly lower density). However, the perturbation in the first model is well away from the decision boundary, and so that one has a higher accuracy.

Since in this particular application, the accuracy is equal to our financial gain in dollars, the Brier score is selecting the wrong model, and we will lose money.

Vapnik's advice that it is often better to form a purely discriminative classifier directly (rather than estimate a probability and threshold it) is based on this sort of situation. If all we are interested is in making a binary decision, then we don't really care what the classifier does away from the decision boundary, so we shouldn't waste resources modelling features of the data distribution that don't affect the decision.

This is a Laconic "if" though. If it is a classification task with fixed misclassification costs, no covariate shift and known and constant operational class priors, then this approach may indeed be better (and the success of the SVM in many practical applications is some evidence of that). However, many applications are not like that, we may not know ahead of time what the misclassification costs are, or equivalently the operational class frequencies. In those applications we are much better off with a probabilistic classifier, and set the thresholds appropriately according to operational conditions.

Whether accuracy is a good performance metric depends on the needs of the application, there is no "one size fits all" policy. We need to understand the tools we use and be aware of their advantages and pitfalls, and consider the purpose of the exercise in choosing the right tool from the toolbox. In this example, the problem with the Brier score is that it ignores the true needs of the application, and no amount of adjusting the threshold will compensate for it's selection of the wrong model.

It is also important to make a distinction between performance evaluation and model selection - they are not the same thing, and sometimes (often?) it is better to have a proper scoring rule for model selection in order to achieve maximum performance according to your metric of real interest (e.g. accuracy).

This answer (+1) usefully shows why accuracy can work when false-negative and false positive costs are equal, for the cost-optimal probability cutoff at 0.5 that accuracy implicitly uses. Much discussion about superiority of proper scoring rules elides the issue of downstream uses and specific cost tradeoffs. If you need to make a choice based on a model, the model needs to work well around the probability cutoff that matches your cost tradeoffs. Different proper scoring rules emphasize different probability regions. Some links are [here](https://stats.stackexchange.com/a/470387/28500). — EdM, Jul 30 '21 at 16:14
Thanks for the info in the link. Similar examples could be constructed for cases where the misclassification costs were not equal (in which case it would be weighted accuracy - balanced error rate being another special case). I certainly agree with @StephanKolassa that "accuracy is problematic even in the context of balanced datasets", but proper scoring rules are not a panacea. — Dikran Marsupial, Jul 30 '21 at 16:23
Why do you say this shows that accuracy picks the right model rather than accuracy sometimes picks the right model? There should be some sense of expected value, not just in the occasional instance, should there not? — Dave, Aug 11 '21 at 18:45
"Why do you say this shows that accuracy picks the right model rather than accuracy sometimes picks the right model? " I wrote "Here is a somewhat adversarial counter-example, where accuracy is better than a proper scoring rule," which is clearly stating that this is one example where this is the case and is in no way stating a general rule. I even highlighted that it was an *adversarial* example. — Dikran Marsupial, Aug 11 '21 at 19:07
But this just shows that sometimes the Brier score picks the wrong model, not any kind of expected loss. Proper scoring rules do not even have to agree on the preferred model. — Dave, Aug 12 '21 at 11:18
@Dave I have explained why this is important in the body of the answer. If you have an application where accuracy *IS* your primary concern, then deferring the classification can make performance worse by focusing modelling on aspects of the data that are unimportant. This is why SVMs have been so successful for that particular type of application. It is the insistence of a one-size-fits all approach that I think is wrong. It isn't accuracy **or** proper scoring rules - both have their uses. — Dikran Marsupial, Aug 12 '21 at 11:23
The important thing is understanding *WHY* in this case the Brier score picks the wrong model. Understand that, and it is easier to see why the SVM (and other purely discriminative classifiers) perform better in some applications (but worse in others) and shouldn't be discarded a-priori. — Dikran Marsupial, Aug 12 '21 at 11:25
BTW if you misrepresent what someone has said, then acknowledging that you had done so (and preferably apollogising) helps to keep on-line discussion cordial. Expectations are irrelevant here. We fit models to individual samples of data, so we need to be aware of the problems that can crop up in individual cases, as well as on-average. If I were arguing that you should never use proper scoring rules, THEN it would matter whether accuracy was better on average, but I have made it abundantly clear I am making no such argument. — Dikran Marsupial, Aug 12 '21 at 11:35
A great answer! Related: ["When is it appropriate to use an improper scoring rule?"](https://stats.stackexchange.com/questions/208529/). @Dave, perhaps interesting for you, too. — Richard Hardy, Jan 10 '22 at 19:56
@RichardHardy the example in the accepted answer to that question is a very nice example of getting distracted by features of the data distribution that are irrelevant to the end goal of classification. Cheers! — Dikran Marsupial, Jan 10 '22 at 19:57

score 3 · Answer 6 · answered Nov 09 '17 at 17:40

3

DaL answer is just exactly this. I'll illustrate it with a very simple example about... selling eggs.

You own an egg shop and each egg you sell generates a net revenue of $2$ dollars. Each customer who enters the shop may either buy an egg or leave without buying any. For some customers you can decide to make a discount and you will only get $1$ dollar revenue but then the customer will always buy.

You plug a webcam that analyses the customer behaviour with features such as "sniffs the eggs", "holds a book with omelette recipes"... and classify them into "wants to buy at $2$ dollars" (positive) and "wants to buy only at $1$ dollar" (negative) before he leaves.

If your classifier makes no mistake, then you get the maximum revenue you can expect. If it's not perfect, then:

for every false positive you loose $1$ dollar because the customer leaves and you didn't try to make a successful discount
for every false negative you loose $1$ dollar because you make a useless discount

Then the accuracy of your classifier is exactly how close you are to the maximum revenue. It is the perfect measure.

But now if the discount is $a$ dollars. The costs are:

false positive: $a$
false negative: $2-a$

Then you need an accuracy weighted with these numbers as a measure of efficiency of the classifier. If $a=0.001$ for example, the measure is totally different. This situation is likely related to imbalanced data: few customers are ready to pay $2$, while most would pay $0.001$. You don't care getting many false positives to get a few more true positives. You can adjust the threshold of the classifier according to this.

If the classifier is about finding relevant documents in a database for example, then you can compare "how much" wasting time reading an irrelevant document is compared to finding a relevant document.

answered Nov 09 '17 at 17:40

Benoit Sanchez

7,377
21
43

1

This is an excellent answer. The performance metric depends on the requirements of the application, and sometimes the accuracy (or more generally the expected loss) is what we are interested in. Of course that doesn't mean it is necessarily the right metric for model selection (e.g. optimising hyper-parameters), but that doesn't mean it shouldn't be used for performance evaluation (or that the "class imbalance problem" is not a problem). – Dikran Marsupial Jul 27 '21 at 08:32
This example seems to illustrate my point: it makes most sense to first make *probabilistic predictions* about the probability a given customer will purchase at a given discount, then make *decisions* on whether or not to offer a given discount to a given customer. The predictions are one input into the decisions, but there are also others (like the discount amount). Yes, in the end it may turn into a "weighted accuracy", but I'd say separating concerns is clearer. For one, a separation can easily be extended to multi-class problems. – Stephan Kolassa Jul 27 '21 at 09:06
@StephanKolassa No, for the problem as stated, accuracy *is* the quantity of interest. Extending the scenario to include new features, such as variable discount amounts, is avoiding the point made by the analogy. I agree that it is a generally good idea to estimate the probability and then threshold at a suitable level, but the point remains that for this particular scenario, the accuracy *is* the quantity of interest. – Dikran Marsupial Jul 30 '21 at 06:56
The work of Vapnik (especially the SVM) provides justification for solving the classification problem directly (avoids wasting resources, including data, on features that are irrelevant to the decision). For me, whether that is a good idea depends on the requirements of the application. It seems fine for e.g. handwritten digit recognition, where costs and priors are fixed, less so for things like medical diagnosis where operational conditions are more fluid. – Dikran Marsupial Jul 30 '21 at 06:58
@DikranMarsupial: accuracy is the KPI of interest here because it happens to lead us to the threshold that maximizes the expected payoff. (I'll trust you two it does, I didn't do the math.) As Benoit notes, if the cost/revenue structure changes, the optimal threshold changes, so we need to optimize a "weighted accuracy". I still believe thinking about this in terms of probabilities and optimal thresholds, separating the statistical and the business problem, is more fruitful. I guess we will not agree here. – Stephan Kolassa Jul 30 '21 at 06:59
I disagree, it is the quantity of interest because it is the financial loss (up to a multiplicative constant). For the threshold argument, that only matters if it is too costly to retrain the classifier, and as Vapnik points out, a purely discriminative classifier may make better use of the data. I agree with both you and Frank Harrell *and* Vladimir Vapnik, but I don't think either is right all the time. It depends on the requirements of the application. – Dikran Marsupial Jul 30 '21 at 07:03
I should point out, I use Gaussian Processes and Kernel Logsitic Regression in my work a lot more than I use SVMs, but both sets of tools belong in my toolbox. – Dikran Marsupial Jul 30 '21 at 07:05
@StephanKolassa I've added an answer with a counter-example where a proper scoring rule makes the wrong choice, which can't be fixed by varying the threshold, and hopefully illustrates the Vapnik perspective. – Dikran Marsupial Jul 30 '21 at 08:53

vonjd · Answer 7 · 2020-04-28T12:08:46.730

2

I wrote a whole blog post on the matter:
https://blog.ephorie.de/zeror-the-simplest-possible-classifier-or-why-high-accuracy-can-be-misleading

ZeroR, the simplest possible classifier, just takes the majority class as the prediction. With highly imbalanced data you will get a very high accuracy, yet if your minority class is the class of interest, this is completely useless. Please find the details and examples in the post.

Bottom line: when dealing with imbalanced data you can construct overly simple classifiers that give a high accuracy yet have no practical value whatsoever...

edited Apr 28 '20 at 12:08

answered Apr 28 '20 at 10:06

vonjd

5,886
4
47
59

1

First, please expand this rather then giving link-only answer. Second, I don't agree with your conclusion, in many cases the data is imbalanced and it may have a lot of practical value, e.g. as discussed in this thread: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – Tim Apr 28 '20 at 12:01
@Tim: I don't see any practical value in ignoring the minority class completely (especially not when this is the class of interest!). – vonjd Apr 28 '20 at 12:05
@Tim: Added more details and the gist of the post, hope this helps... Thank you! – vonjd Apr 28 '20 at 12:09
@Tim:...oh, and thank you for the link to the other question, very interesting! – vonjd Apr 28 '20 at 12:11
1

Thanks, it's clearer now. – Tim Apr 28 '20 at 13:53

score 1 · Answer 8 · answered Sep 27 '18 at 14:27

Classification accuracy is the number of correct predictions divided by the total number of predictions.

Accuracy can be misleading. For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy. So, further performance measures are needed such as F1 score and Brier score.

score 1 · Answer 9 · answered Mar 28 '20 at 18:36

After reading through all the answers above, here is an appeal to common sense. Optimality is a flexible term and always needs to be qualified; in other words, saying a model or algorithm is "optimal" is meaningless, especially in a scientific sense.

Whenever anyone says they are scientifically optimizing something, I recommend asking a question like: "In what sense do you define optimality?" This is because in science, unless you can measure something, you cannot optimize (maximize, minimize, etc.) it.

As an example, the OP asks the following:

"Why is accuracy not the best measure for assessing classification models?"

There is an embedded reference to optimization in the word "best" from the question above. "Best" is meaningless in science because "goodness" cannot be measured scientifically.

The scientifically correct response to this question is that the OP needed to define what "good" means. In the real world (outside of academic exercises and Kaggle competitions) there is always a cost/benefit structure to consider when using a machine to suggest or make decisions to or on behalf of/instead of people.

For classification tasks, that information can be embedded in a cost/benefit matrix with entries corresponding to those of the confusion matrix. Finally, since cost/benefit information is a function of the people who are considering using mechanistic help for their decision-making, it is subject to change with the circumstances, and therefore, there is never going to be one fixed measure of optimality which will work for all time in even one problem, let alone all problems (i.e., "models") involving classification.

Any measure of optimality for classification which ignores costs does so at its own risk. Even the ROC AUC fails to be cost-invariant, as shown in this figure.

One of my favorite quotes from a statistician (Youden): "It is, in fact, not a statistical matter to decide what weights should be attached to these two types of diagnostic error." First page of https://acsjournals.onlinelibrary.wiley.com/doi/abs/10.1002/1097-0142%281950%293%3A1%3C32%3A%3AAID-CNCR2820030106%3E3.0.CO%3B2-3 — brethvoice, Mar 28 '20 at 19:00
Link to my full master's thesis (from which the above linked image was taken): https://www.researchgate.net/publication/258366133_A_risk-based_comparison_of_classification_systems/link/02e7e52b0cd8ac33a3000000/download — brethvoice, Oct 14 '20 at 21:18
After reading the answers above and discussions of "imbalanced classes," I am reminded of N. N. Taleb's writings (e.g., "The Black Swan"). Just sayin... — brethvoice, Nov 09 '20 at 14:38

score -3 · Answer 10 · answered Nov 09 '17 at 11:05

You may view accuracy as the $R^2$ of classification: an initially appealing metric with which to compare models, that falls short under detailed examination.

In both cases overfitting can be a major problem. Just as in the case of a high $R^2$ might mean that you are modelling the noise rather than the signal, a high accuracy may be a red-flag that your model applied too rigidly to your test dataset and does not have general applicability. This is especially problematic when you have highly imbalanced classification categories. The most accurate model might be a trivial one which classifies all data as one category (with the accuracy equal to proportion of the most frequent category), but this accuracy will fall spectacularly if you need to classify a dataset with a different true distribution of categories.

As others have noted, another problem with accuracy is an implicit indifference to the price of failure - i.e. an assumption that all mis-classifications are equal. In practice they are not, and the costs of getting the wrong classification is highly subject dependent and you may prefer to minimise a particular kind of wrongness than maximise accuracy.

Hum. (1) I'd assume that evaluating accuracy or any other metric *out-of-sample* would be understood, so I don't really see how accuracy has more of a *specific overfitting problem*. (2) if you apply a model trained on population A to a *different* population B, then you are comparing apples to oranges, and I again don't really see how this is a *specific problem for accuracy*. — Stephan Kolassa, Nov 09 '17 at 11:28
(1) It is nevertheless a problem for accuracy, and the question is about using accuracy as a gold-standard. (2) The point of building a classifier is to use it on the oranges, not the just the apples. It should be general enough to capture the essential signals in the data (such that they exist), rather than being a catechism for your training data. — James, Nov 09 '17 at 11:45