geometric mean for binary classification doesn't use sensitivity of each class

Question

scikit-learn's contrib package, imbalanced-learn, has a function, geometric_mean_score(), which has the following in its documentation:

The geometric mean (G-mean) is the root of the product of class-wise sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. For binary classification G-mean is the squared root of the product of the sensitivity and specificity. For multi-class problems it is a higher root of the product of sensitivity for each class.

Why is sensitivity and specificity used for binary classification? In the sources below, geometric mean is defined as the geo mean of precision and recall.

Cross Validated answer

g-mean is defined as $g = \sqrt{\ Precision * Recall\ }$

Towards DS: Beyond Accuracy

There are other metrics for combining precision and recall, such as the Geometric Mean of precision and recall, but the F1 score is the most commonly used.

Fun question (+1); I think what you observed is mostly an inconsistency in definitions. — usεr11852, Apr 21 '20 at 15:38

score 4 · Accepted Answer · edited Jan 17 '21 at 22:52

"G-mean" in itself does not refer to something other than the result of: $g=\sqrt{x\cdot y}$ when dealing with two variables $x$ and $y$. Therefore, unless formally defined I would be careful to interpreter what a particular author refers at.

That said, imbalanced-learn's geometric_mean_score() does the right calculation based on the reference they used. Kubat & Matwin (1997) Addressing the curse of imbalanced training sets: one-sided selection define the geometric mean $g$ based on the "accuracy on positive examples" and "accuracy on negative examples", which are defined respectively as the metrics Sensitivity (True Positive Rate - TPR) and Specificity (True Negative Rate - TNR). Therefore, the geometric_mean_score() function is correct; it reproduces the methodology presented by the references it cites.

Sensitivity and Specificity are informative metrics on how likely are we to detect instances from the Positive and Negative class respectively from our hold-out test sample. In that sense, Specificity is essentially our Sensitivity of detecting Negative class examples. This is further emphasised when looking at the multi-class version of the G-mean where we compute the $n$-th root of the product of Sensitivity for each class. In the case where $n=2$ and assuming we have classes A and B with the class A as the "Positive" one and class B as the "Negative" one, the Sensitivity of class B is just the Specificity in the binary classification. In the case where $n>2$, we cannot refer to "Positive" and "Negative" class (aside the context of one-vs-rest classification) so we just use the product of the per class Sensitivity score, i.e. $\sqrt[n]{x_1 \cdot x_2 \cdot \dots \cdot x_n }$ where $x_i$ here refers to the Recall score from the $i$-th class.

Let me stress that, Sensitivity and Specificity are metrics that dichotomise our output and should be, at first instance, avoided when optimising classifier performance. A more detailed discussion as to why metrics like Sensitivity and Accuracy that inherently dichotomise our outputs are often suboptimal can be found here: Why is accuracy not the best measure for assessing classification models?

Further commentary: I think there is some confusion on how this "g-mean" is defined, stems from the fact that the $F_1$ score is defined in terms of Precision (Positive Predictive Value - PPV) and Recall (TPR) and is the harmonic mean ($h = \frac{2 \cdot x \cdot y}{x+y}$) of the two. Some people might use the geometric mean $g$ instead of the harmonic mean $h$ thinking it is just another reformulation without realising that they redefining an existing metric. Please note that the geometric mean of Precision and Recall is not inherently wrong; just it is not what F-scores refer at nor what the papers cited by imbalanced-learn use.

incredibly informative. perhaps, I'll need to dig more into the paper, but I'm interested to know why for multi-class problems the "higher root of the product of sensitivity for each class" is used, and specificity is ignored. — Anders Swanson, Apr 21 '20 at 21:13
I am glad I could help. Please see my updated answer; I edited it accordingly. Effectively we "always used" the per class Sensitivity, just in the multi-class scenario, we no longer refer directly to positive and negatives examples. — usεr11852, Apr 21 '20 at 23:23

geometric mean for binary classification doesn't use sensitivity of each class

Cross Validated answer

Towards DS: Beyond Accuracy

1 Answers1