3

I was asked recently about a model evaluation metrics involved in a binary classifier.

This person mentioned they were only interested in the True Positives and the True Negatives.

enter image description here

We could define a metric by ignoring FN & FP, but I am unsure what they would demonstrate.

I cannot find any literature that are defined with only TP & TN.

Is there any basis on using only these values to give some meaningful information?

And if so, what useful metrics could be defined?

Edit, examples for clarity:

Accuracy could not be used since it is defined as:

$\frac{TP + TN}{TP + TN + FP + FN}$

and when removing FN & FP you would have:

$\frac{TP + TN}{TP + TN} = 1$ For all TP and TN > 0

This is not such a useful metric. Some examples could be:

Truthiness = $\frac{TP}{TP + TN}$

Untruthiness = $\frac{TN}{TP + TN}$

Where as Truthiness is something like: "how completely true my predictions are?" Where as Untruthiness is something like: "how utterly false my predictions are?"

Not sure what these mean but I am just defining metrics to give examples of what I mean. Is there any literature on these kinds of metrics which only use TP and TN?

Nathan McCoy
  • 127
  • 1
  • 12
  • A list of common functions that can be computed from a confusion matrix can be found here: https://en.wikipedia.org/wiki/Confusion_matrix – DifferentialPleiometry Dec 19 '21 at 01:58
  • I disagree with your interpretations of these ratios. Your "truthiness" score is the conditional (frequency) probability of a classification being a true positive given that the classification was true. Likewise your "untruthiness" score is the corresponding probability of getting a true negative given that the classification was true. These speak more to the tendency toward positives or negatives when the model is doing well than it does to whether the model is doing well. – DifferentialPleiometry Dec 19 '21 at 02:09

2 Answers2

5

If the size of your data is $n$, then accuracy is $\tfrac{TP + TN}{n}$. Notice that if you use accuracy to compare different models on the same dataset, then $n$ is constant and can be dropped from the equation, so it reduces to $TP + TN$. You can call it "unnormalized accuracy" if you wish, since the only difference is that it is non-negative integer valued instead of being a fraction. What is arguable is if accuracy is a useful metric, since this isn't always that obvious.

You argue that such metric is not useful since it depends on sample size $n$ and cannot be compared across datasets. Notice however that the same argument applies to all the other metrics. Comparing metrics across different data is often questionable, since they depend on many characteristics of the data like sample size, scaling of the data (variance of the dependent variable), base rate (in classification), etc. We have "unitless" metrics like accuracy, MAPE, or $R^2$, but as you can learn from the above links, they all suffer from different problems and the "unitlessness" is often illusory and misleading.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • but a raw count would not always be useful. what would this show? this would also be dependent upon each data set and could not be compared across classifiers with different data due to the lack or normalisation correct? – Nathan McCoy Jan 19 '18 at 09:23
  • @NathanMcCoy I didn't say that it will always be useful, obviously it will depend on sample size, so you cannot use it with different sample sizes. On another hand, *many* of the metrics depend on your data, e.g. RMSE depends on scale of your dependent variable etc. – Tim Jan 19 '18 at 09:26
  • @NathanMcCoy moreover, in general comparing metrics measured on different data in many cases would be questionable since they often will depend on many properties of the data like sample size, scaling of the data, base rate (in classification) so the same amount of "error" may mean different things for different data. – Tim Jan 19 '18 at 09:35
  • Downvote. Accuracy is (TP+TN)/(P+N) = 1 which does not answer the question. And the absolute values say nothing – questionto42standswithUkraine Dec 18 '21 at 20:45
  • @questionto42 I have no idea what you are trying to say by your comment. If accuracy = 1, it would be useless as a metric because it'd be constant. I also don't say anything about absolute values. – Tim Dec 18 '21 at 22:09
  • @Tim You have missed the whole point of the question, therefore it is not astonishing if you do not understand the comment. The accuracy *is* useless in a case where you do not have any False cases, which is all what the question is about. And the side-note on the absolute values is a side-note on your "unnormalized accuracy" which is of course meaningless as well if the accuracy itself is meaningless. – questionto42standswithUkraine Dec 18 '21 at 22:40
  • This answer got upvoted twice after I downvoted it yesterday for a clear reason: it does not answer the question of a metric that uses *only* TP and TN, it only says that there are no such metrics. Seems like there are some "friends" around (voting ring). I raised a moderator flag. – questionto42standswithUkraine Dec 19 '21 at 12:25
  • @questionto42 you are throwing accusations at me based on the fact alone that you disagree with my answer. This is not a [nice behavior](https://stats.stackexchange.com/conduct). Everyone is free to post their answers and vote on questions and answers, but we expect users to behave with respect to others. – Tim Dec 19 '21 at 13:55
  • Moreover, I didn't claim that such metrics do not exist. You are misreading my answer. – Tim Dec 19 '21 at 13:56
  • Appreciate your clear reaction, I seem to have misread your claim then (then I do not know what it is about, though). Nice behavior is when you put time into it and try to get further. Trying to answer a question and find out if there is some general mistake in a Q/A is also nice behavior. As you have obviously put some time into this as well and since you react, I am fine to accept it as nice behavior. I still do not see how your answer should help the OP when False test results shall be ignored or excluded according to the question. – questionto42standswithUkraine Dec 19 '21 at 16:06
  • @questionto42 the answer got six +1 and got accepted by OP so apparently it answers the question and people find it useful. – Tim Dec 19 '21 at 18:35
1

Recognition rate

Eurostat uses recognition rate for your idea of truthiness:

The number of positive decisions on applications for international protection as a proportion of the total number of decisions

Same definition is used in the researchgate forum:

recognition rate= (no. of correctly identified images / Total no. of images)*100

One could expand it to:

  • positive recognition rate: TP/(TP+TN) = TP/all
  • negative recognition rate: TN/(TN+TP) = TN/all

Predicted positive condition rate (PPCR)

Apart from the recognition rate, there exists a measure that might also fit. It is the

$\text{Predicted positive condition rate}=\frac{tp+fp}{tp+fp+tn+fn}$

... which identifies the percentage of the total population that is flagged. For example, for a search engine that returns 30 results (retrieved documents) out of 1,000,000 documents, the PPCR is 0.003%.

See: Precision and recall and go to "Imbalanced data" at the bottom middle

If you take out the FP and FP (for example, because your database simply does not include missing rows in its label calculations), you get a sort of true term of the PPCR, a PPCR filtered for true test results:

  • $\text{Predicted positive condition rate (true only)} = \frac{tp}{tp+tn}$

And finally, just inventing a name now: dropping the false test results might mean that the condition is always true. Which would then allow the naming:

  • "Predicted positive true rate" (PPTR)

Which is what you wanted to create. But my name invention might not fit, I might misunderstand the word "condition" here. And since the examples of the recognition rate clearly drop the false test results, it is probably better to use tp/all as "recognition rate".

Example


++ Update and warning ++

This example is probably wrong. It turned out that the TN in the example were FP in reality, and the TN and the FN were instead the observations that were excluded from the dataset. With the example at hand, I could only calculate precision in the end (no recall, since that needs FN, and no "recognition rate", since that needs TN). Therefore, the example given is probably flawed. I leave it just as an idea.


I have the real case of a dataset where FP and FN cannot exist by design. Of course, in the model, a label is always True since the metrics are not about questioning the labels but the predictions. Only the predicted class can be True or False. And in my case, the label is already an evaluation of an observation, being

"yes" 

if the prediction is correct and

"no"

if not.

A prediction is True or False (T or F), and the Positives are those cases where the prediction predicts the class of interest, while the Negatives stand for those where that class is not predicted.

But what would be a False Negative or False Positive here? A FN would be a label saying that the prediction is wrong, while the prediction is actually right. Which makes no sense since the label is right by definition. A FP would be a label saying that the prediction is correct, while the prediction is actually wrong. Which makes no sense either.

There are only TP (hit) or TN (correct rejection) by design in that dataset. The same link shows FN as "type II error, miss, underestimation" and FP as "type I error, false alarm, overestimation" which both simply cannot appear since the prediction evaluation which we need to compare it to the label column is the same as the label: a wrong prediction gets a "no" label, and a correct prediction gets a "yes" label. There is no wrong prediction that is correct (label "yes") or a correct prediction that is wrong (label "no").

In that case I do not see any chance of using the common metrics like Recall or Precision or even accuracy which all need false test results, and your idea of "truthiness" is relevant.

  • The question asks about classification models. If a model cannot make mistakes, you don't need any metrics to judge the performace. – Tim Dec 19 '21 at 15:10
  • @Tim Thank you for your comment. The answer here is just one example. You could generate a similar case with normal labels as well. Just prepare a dataset where you drop the false test results before starting the model. This could happen in a 2-class prediction where you calculate the second class only if you are able to predict the first class at all, else the row is dropped (missing by design) so that you miss both Positives and Negatives as soon as you do not know their class prediction well enough. Or you even keep such missed rows but ignore them for the metric in question. – questionto42standswithUkraine Dec 19 '21 at 16:00