3

If k raters are asked to rate the same set of objects on a continuous or Likert scale, there is the ICC3 for measuring the inter-rater agreement.

Is there also an agreement measure, if all raters have to order the rated objects by preference?

A naive approach would be to compute the Spearman correlation for all pairs of objects and then take the average, but as this most certainly is a standard problem, I wonder whether there is a standerd solution for it.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
cdalitz
  • 2,844
  • 6
  • 13
  • [Paired preference models](https://stats.stackexchange.com/a/10741/930), for instance, or [log-linear approaches](https://stats.stackexchange.com/a/11720/930).. – chl Nov 21 '20 at 15:14
  • @chl These are models that yield score values from ranking data. This is, however, not an issue in my case, because *all* raters rate *all* objects completely, and thus ML parameter estimation is not necessary. I am looking for an index that measures how well the raters agree. – cdalitz Nov 21 '20 at 15:52
  • Something like the coefficient of concordance [Kendall's W](https://en.wikipedia.org/wiki/Kendall%27s_W), then? – chl Nov 21 '20 at 18:40
  • @chl Yes, thanks! Kendall's W is exactly what I was looking for. Interestingly, according to the wikipedia site it is almost the same as my suggestion of computing the average Spearman correltaion between all pairs. – cdalitz Nov 21 '20 at 20:02
  • @chi I have treid Kendall's W, but the result does not look very reasonable in various test cases (see my answer below). Do yo know any other indices which I might try out? – cdalitz Nov 26 '20 at 10:31

1 Answers1

0

Following @chl's suggestion, I have treid Kendall's W, but the result is somewhat suprising. Although there is 80% perfect agreement among the raters, Kendall's W is only 0.36:

> library(irr)
> x <- data.frame(R1=c(1,2,3), R1=c(1,2,3), R1=c(1,2,3), R1=c(3,2,1), R1=c(1,2,3))
> x
  R1 R1.1 R1.2 R1.3 R1.4
1  1    1    1    3    1
2  2    2    2    2    2
3  3    3    3    1    3
> kendall(x)$value
[1] 0.36

Does someone know of a different index that yields a more reasonable result in this case?

chl
  • 50,972
  • 18
  • 205
  • 364
cdalitz
  • 2,844
  • 6
  • 13
  • The problem comes from the fact that you're using a 3-point rating scale, which implies there's little variation around the average ratings. The same applies in the case of the ICC for agreement. I can increase you Kendall's W by simply using a larger range of responses, e.g., `x – chl Nov 26 '20 at 11:15
  • Hm, as these are ranks, gaps in the responses are not possible. Maybe Kendall's W is not appropriate for rankings, but only for Likert scales? – cdalitz Nov 26 '20 at 12:12
  • The result isn't unreasonable, and Kendall's W is appropriate. The Spearman correlation between `c(1, 2, 3)` and `c(3, 2, 1)` is -1. Out of all the `choose(5, 2)` = 10 Spearman correlations between the pairs of ranks, 4 of them involving your 4th rater are -1, and the remaining 6 pairs of ranks are correlated 1 with each other. We therefore find a mean Spearman correlation of .2. The Kendall's W is linearly related to the average Spearman correlation for all pairs of ranks. Given $m$ judges (here, 5), we can see $\bar{\rho}=((m*W) - 1)/(m - 1)=((5 * .36) - 1)/(5 - 1)=.2$ – awhug Nov 26 '20 at 12:58
  • @awhug This means that the "unexpected" result can be traced back to the fact that a single dissenting rater is overrepresented in all pairs for small number of raters. I will ponder ways to circumvent this problem (I am sure, though, that someone already came up with an index that would yield 0.8 in this case). – cdalitz Nov 26 '20 at 13:05
  • Yeah, I think the sample size is partly the issue here. One rater disagreeing on the appropriate ranking of every single object compared to the other four has a big impact. It's a little hard to see how a value of 0.8 could arise - there isn't really 80% agreement among the (pairs of) raters, but rather 80% of the ranks themselves are identical. You could instead consider the correlation between the raters' ranks and a modal/criterion ranking, but this is a somewhat different question, and even then in this case would only yield an average Spearman correlation of .6. – awhug Nov 26 '20 at 13:34