I have two non-normal populations of scores (A and B) and want to know the probability that a randomly selected score from A is greater than a randomly selected score from B. My plan was to estimate this probability by sampling multiple corresponding pairs from the two populations. But I have to repeat this on many pairs of populations and my question is whether sampling is the best option or is there a more efficient approach I could use, for example some kind of test that compares distributions that calculates the desired probability. I should note too that all values are integers and ties are quite possible Thanks
-
Do you know the exact distributions of your populations? – WavesWashSands Mar 28 '18 at 10:23
-
Values are integers from 0 to 100 but beyond that won't have any specific distribution. – Terry B Mar 28 '18 at 20:05
2 Answers
If there had been no ties in your scores, the probability would have been equal to $\frac{U}{mn}$ where $U$ is the Wilcoxon-Mann-Whitney U statistic and $n$ and $m$ are the sample sizes of the two groups. Using a some implementation of the WMW U test and extracting the test statistic might have been faster than randomly sampling many pairs, depending on what kind of accuracy you need for your estimates, and as a bonus you would have gotten a hypothesis test for the null hypothesis that $P(A > B) = P(A < B)$
However, since your scores are integers in the relatively small range $0, \ldots, 100$, we can do even better: count the number of occurrences in each group of each possible value $C(A=i)$ and $C(B=i)$. Then we can efficiently calculate the number of scores in $B$ less than $i$ as
$$C(B< i) = C(B < i-2)+C(B=i-1)$$
and finally calculate the probability
$$P(A > B) = \frac{1}{mn}\sum_{i=0}^{100}C(A=i)C(B<i)$$
This algorithm is $O(n + m)$. The last two steps can be rolled into one loop in languages where this is efficient, but in R for example, it would probably be best to implement it like this:
function(A, B)
{
cA = tabulate(A+1, 101)
cB = tabulate(B+1, 101)
CB = cumsum(c(0, cB[1:100]))
return(sum(cA*CB)/(length(A)*length(B)))
}
Which, when dropping the local variables and exploiting a quirk with the tabulate
function be reduced to
function(A, B)
sum(tabulate(A, 100) * cumsum(tabulate(B+1, 100)) /
(length(A)*length(B))

- 26
- 3
-
Thanks Eivind. I should have noted sooner but my data are integers 0:100, so ties are entirely possible. Does the above hold in cases with ties? – Terry B Mar 28 '18 at 23:12
-
No, unfortunately not. In general $U/mn$ is equal to $P(A> B) + 0.5P(A=B)$, also known as the Vargha-Delaney A statistic, which is a commonly used effect size in some fields, but, of course, not what you were looking for. – Eivind Samuelsen Mar 29 '18 at 06:12
-
Knowing that your observations are in a relatively small finite set opens up new possibilities, though, I will edit my answer. – Eivind Samuelsen Mar 29 '18 at 06:23
-
What you are describing sounds identical to the purpose of the Area Under the Curve for the Reciever Operator Characteristics (AUC-ROC or AUROC), which is closely related to the Wilcoxon Signed Rank test ($AUROC = U/{mn}$) where $m$ and $n$ are your sample sizes for $A$ and $B$ and $U$ is the U test statistic.
The AUROC is often defined along the lines of the probability that a randomly selected sample in one group is ranked higher than a randomly selected sample from another.
A useful discussion is provided in an answer here What does AUC stand for and what is it?
It is rank based and so in non-parametric.
For a useful discussion on AUROC, see Hanley and MacNeil, Radiology, 1982, Vol 143, Pg 29

- 2,863
- 1
- 8
- 24