Which measure of inter-rater agreement for multi-class rating

Question

In our experiment subjects could choose up to three values from a list of 18 categorical values.

So let's say that the classes are animals: cat, dog, mouse, sheep... And that the subjects were asked to identify animals in 5 pictures. Their results would look like:

S1: sheep; cat and dog; dog and sheep; dog; dog

S2: sheep; cat; cat and sheep; dog and cat; cat

Now I have some issues to find a good measure for inter-rater agreement.

One easy measure I calculated is to determine the union (at least one animal has been found in the picture by both raters) and the intersection (raters found exactly the same animals in the picture). The problem is that they are not statistical measures.

The best solution I found, is to calculate Cohen's kappa for each of the 18 values, for each couple of raters.

So:

sheep matrix: 1 (agreement: a sheep in the picture); 1 (disagreement); 1 (disagreement); 3 (agreement: no sheeps in the picture)

dog matrix: 1 (agreement: a dog in the picture); 3 (disagreement); 0 (disagreement); 1 (agreement: no dog in the picture)

I find these very hard to interpret, especially because in the real data the fourth case (agreement on the absence of the animal in the picture) is very often a very high number (some animals were found in about 1 picture out of 100). It appears to me that this makes the kappa value not appropriate for this application - but please correct me if I am wrong.

I am looking for a better measure, possibly an aggregated measure for all the classes (and not one by one).

Firebug · Accepted Answer · 2016-07-18T12:50:09.920

You calculate a single Kappa for all the categories at once. The formula for $\kappa$ is:

$$\kappa = \frac{p_{o}-p_{e}}{1-p_{e}} = \frac{N_{o}-N_{e}}{N-N_{e}}$$

You see, this depends only on $p_{o}$ and $p_{e}$, respectively the observed agreement and the chance agreement. The number of categories does not matter at all.

Say you have a table of observations $\text m_{i,j}$ with $k$ categories.

$\sum_i^k{m_{i,i}}$ is the sum of agreements in the table.
$\sum_j^k{m_{l,j}}$ is the sum of counts in the $l$-th row.
$\sum_i^k{m_{i,l}}$ is the sum of counts in the $l$-th column.

So, when you multiply the total counts in a row by the total counts in the respective column and divide by $N$ you have a conservative estimate of the count of agreements by chance in that category. Summing over all pairs of rows and columns of each category you have an overall estimate of chance agreement, and you can plug that into the $\kappa$ formula.

$$N_o = \sum_i^k{m_{i,i}}$$

$$N_e = \frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)$$

$$\therefore \kappa = \frac{N_{o}-N_{e}}{N-N_{e}} = \frac{\sum_i^k{m_{i,i}}-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}{N-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}$$

And here's a sample code in R:

kappa = function(M) (sum(diag(M)) - 1/sum(M) * sum(colSums(M) * rowSums(M)))/
(sum(M) - 1/sum(M) * sum(colSums(M) * rowSums(M)))

In case of perfect agreement:

> M = table(iris$Species, iris$Species)
> print(kappa(M))
[1] 1

In case of random predictions

> M = table(sample(iris$Species, 150L), iris$Species)
> print(kappa(M))
[1] -0.09

One should also be wary of the impact an unbalanced sample set can have. Looking at the margins of a confusion matrix, as here http://stats.stackexchange.com/questions/82162/kappa-statistic-in-plain-english (see the Interpretation section) can be very helpful in seeing where concordance and discordance are occurring. — Ashe, Jul 18 '16 at 16:27

Which measure of inter-rater agreement for multi-class rating

1 Answers1