8

I am trying to understand the relative behavior of the following rank correlation statistics:

  1. Spearman coefficient
  2. Kendall Tau / Concordance percentage
  3. Normalized Gini coefficient (area under curve of percentage captured versus percentage observations)
  4. Normalized Area under ROC curve (for binary classifiers)

I don't believe any of these are functionally related to the others. The accepted answer here references this paper and Spearman and Kendall are highly correlated (as one would expect).

Are there good intuitions behind/discussions of relative (across datasets) or absolute (for a given dataset) differences for (any pair of) these measures?

cohoz
  • 618
  • 5
  • 16
  • ROC curve is not a rank order correlation statistic as far as I'm aware, I don't believe the Gini coefficient is either, as it is a measure of statistical dispersion. Perhaps I'm wrong or have misinterpreted. – analystic Oct 01 '12 at 04:46
  • Sorry, I did a poor job of defining what I mean -- rank by the prediction, not by the target. Then the Gini statistic (similarly ROC) as intended is similarly to how the term is used [here](http://www.kaggle.com/c/ClaimPredictionChallenge/details/Evaluation). This is a somewhat standard usage (see for example, this [random paper (pdf)](http://www.jcic.org.tw/publish/2005Q110.pdf) from near the top of google's results. – cohoz Oct 02 '12 at 01:44
  • 1
    I think you're misunderstanding two uses of the term "rank". Rho and Tau refer to calculating a monotonic relationship between two variables. The other two metrics are used for determining the effectiveness of a statistical model. In this sense, they may "rank" models according to their predictive power, and provide information about the models, however they are not measures of rank correlation. – analystic Oct 02 '12 at 11:43
  • Can you explain further? I am trying to see how well a predicted score ranks a target variable. The Gini is one way to do this. Why isn't Spearman's rho another? – cohoz Oct 03 '12 at 01:20
  • how does a "predicted scored rank a target variable"? I don't know what you mean by this. You have target variables and you have models which predict target variables, they don't rank them. However, the Gini index is not a model and doesn't predict a dependent variable, as I understand its use. – analystic Oct 05 '12 at 10:07
  • To clarify, the correlation describes a Relationship between variables but the roc and Gini describe the predictive power of a model. – analystic Oct 06 '12 at 03:04
  • I'm not sure I see the distinction. A Gini of 1 for a binary classification task means that the model scored every 1 higher than every 0 in the validation dataset. The relationship between the vector of scores and the target vector (as measured by the correlation statistic) is also 1. If we look at the dataset of pairs (predictions, target) and rank by the predicted score, we only need the rank ordering of the target were when calculating the Gini or ROC. That's the sense that I mean that we are interested in seeing how well the score orders ("ranks") the target. – cohoz Oct 06 '12 at 04:14
  • ROC area is identical to the concordance probability, and various rank correlation indexes (Kendall tau, Somers' D) are derived from this. – Frank Harrell Aug 19 '13 at 18:31
  • Would anyone happen to have an appropriate citation for the normalized Gini coefficient? – kdoherty Aug 31 '21 at 16:19

1 Answers1

2

The proposed question is rather complicated. As analystic already pointed out, I don't think all these measures can be compared straightforwardly, because rank correlation coefficients, Gini coefficient, and AUC (area under ROC curve) are generally defined on different domains.

However, there is a very close relation between Kendall's $\tau$ and Spearman's $\rho$, the two rank correlation coefficients in the list. While the paper cohoz mentioned has demonstrated their relation empirically (Figure 3), this relation can actually be quantified theoretically. Let $\pi$ and $\sigma$ be two rankings, and $\pi(i)$ and $\sigma(i)$ be the ranks of item $i$ in $\pi$ and $\sigma$, respectively. The Kendall distance and Spearman distance between $\pi$ and $\sigma$ are defined as follows: $$ K(\pi,\sigma) = \# \lbrace \; (i,j) \, \vert \, \pi(i)>\pi(j) \text{ and } \sigma(i)<\sigma(j) \; \rbrace $$ $$ S(\pi,\sigma) = \sum_i \left( \pi(i) - \sigma(i)\right)^2 $$ We have the following relation between $K$ and $S$ following [Diaconis and Graham 1977]: $$ \frac{1}{\sqrt{n}}K(\pi,\sigma) \le S(\pi,\sigma) \le 2K(\pi,\sigma) $$ Because the rank correlation coefficients are just the normalization of the rank distances to the interval $[-1,1]$, similar inequalities can be easily derived between $\tau$ and $\rho$. In the statistical ranking literature, results are mostly represented in terms of distances rather than coefficients.

Two more things:

  1. The rankings $\pi$ and $\sigma$ must be complete rankings in order to make this inequality hold. That is, they cannot be partial rankings.
  2. In case one is interested in $\tau$ and $\rho$ defined not only on rankings but on continuous random variables, the situation is more involved. Here is a related paper by Fredicks and Nelsen.
Weiwei
  • 670
  • 4
  • 11