0

I am using an evaluation metric to reward the true positives and penalize the false positive ones retrieved by a function $f(\cdot)$. Indeed, it can be represented as follows: $\frac{\texttt{|TP|} - \texttt{|FP|}}{|\texttt{instances}|}$, where $\texttt{|TP|}$ and $\texttt{|FP|}$ are the number of true positives and false positives, respectively.

The goal is simple: selecting a function that maximizes $\texttt{TP}$ while minimizes $\texttt{FP}$.

I need to find a standard name for this formula to motivate its advantages for my work. I am familiar with "sensitivity", "specificity", "F-measure", "recall", and "precision". However, none of them computes what I'm evaluating here.

mhn_namak
  • 103
  • 4

1 Answers1

1

I do not think this metric has an "official" name. For instance, it does not appear on the very comprehensive Wikipedia page on sensitivity and specificity, which also discusses many other related metrics.


I am a bit doubtful whether your proposed measure is really very useful. Suppose you have a completely random distribution of instances, with >50% positives and <50% negatives. Then you can maximize your criterion $f$ by classifying everything as "positive" - regardless of whether there are 51% true positives, or 60%, or 99.99%. Similarly, if there are <50% positives and >50% negatives, then you will maximize your function by classifying everything as "negative", again regardless of the actual prevalences. This incentive structure does not look very helpful to. (And of course, the same argument holds if these prevalences are conditional on predictors.)

This is very much related to the problems of straight-up accuracy as an evaluation measure, where the exact same problem comes up. I would recommend that you take a look at Why is accuracy not the best measure for assessing classification models? and think about how this applies to your measure.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks :). Following your example, assume 100 instances of people to be classified as male/female. Assume 60 men and 40 women. If f returns all 100 people as men, the objective value is 60-40=20% but if f only returns men as men, it is 60%. If f returns nothing it is 0 but not maximized yet. Thus, the objective is maximized if it returns as more actual men and fewest possible women. The goal is to find all and only positive ones. The reason I don't want to use f-measure is that this function gives me some properties to optimize (it can decomposed into positive/negative components easily) – mhn_namak Feb 04 '19 at 23:01
  • Yes, of course if you have a model that predicts classes perfectly, then you are good. But typically there is no perfect model, i.e., we have a predictive distribution of 60/40 *conditional on all predictors*. Your proposed measure would in this situation reward predicting "man" for *all* cases. The same for 51/49, and 99.99/0.01. I maintain that this makes no sense. ... – Stephan Kolassa Feb 05 '19 at 07:06
  • ... And also, that your measure cannot differentiate between these three cases. So it won't even tell us that there is something here that we may need to look more deeply into. For this, you need proper [tag:scoring-rule]s or similar measures that work on *predictive distributions*. – Stephan Kolassa Feb 05 '19 at 07:07