"G-mean" in itself does not refer to something other than the result of: $g=\sqrt{x\cdot y}$ when dealing with two variables $x$ and $y$. Therefore, unless formally defined I would be careful to interpreter what a particular author refers at.
That said, imbalanced-learn
's geometric_mean_score()
does the right calculation based on the reference they used. Kubat & Matwin (1997) Addressing the curse of imbalanced training sets: one-sided selection define the geometric mean $g$ based on the "accuracy on positive examples" and "accuracy on negative examples", which are defined respectively as the metrics Sensitivity (True Positive Rate - TPR) and Specificity (True Negative Rate - TNR). Therefore, the geometric_mean_score()
function is correct; it reproduces the methodology presented by the references it cites.
Sensitivity and Specificity are informative metrics on how likely are we to detect instances from the Positive and Negative class respectively from our hold-out test sample. In that sense, Specificity is essentially our Sensitivity of detecting Negative class examples. This is further emphasised when looking at the multi-class version of the G-mean where we compute the $n$-th root of the product of Sensitivity for each class. In the case where $n=2$ and assuming we have classes A
and B
with the class A
as the "Positive" one and class B
as the "Negative" one, the Sensitivity of class B is just the Specificity in the binary classification. In the case where $n>2$, we cannot refer to "Positive" and "Negative" class (aside the context of one-vs-rest classification) so we just use the product of the per class Sensitivity score, i.e. $\sqrt[n]{x_1 \cdot x_2 \cdot \dots \cdot x_n }$ where $x_i$ here refers to the Recall score from the $i$-th class.
Let me stress that, Sensitivity and Specificity are metrics that dichotomise our output and should be, at first instance, avoided when optimising classifier performance. A more detailed discussion as to why metrics like Sensitivity and Accuracy that inherently dichotomise our outputs are often suboptimal can be found here: Why is accuracy not the best measure for assessing classification models?
Further commentary: I think there is some confusion on how this "g-mean" is defined, stems from the fact that the $F_1$ score is defined in terms of Precision (Positive Predictive Value - PPV) and Recall (TPR) and is the harmonic mean ($h = \frac{2 \cdot x \cdot y}{x+y}$) of the two. Some people might use the geometric mean $g$ instead of the harmonic mean $h$ thinking it is just another reformulation without realising that they redefining an existing metric. Please note that the geometric mean of Precision and Recall is not inherently wrong; just it is not what F-scores refer at nor what the papers cited by imbalanced-learn
use.