In our experiment subjects could choose up to three values from a list of 18 categorical values.
So let's say that the classes are animals: cat, dog, mouse, sheep... And that the subjects were asked to identify animals in 5 pictures. Their results would look like:
S1: sheep; cat and dog; dog and sheep; dog; dog
S2: sheep; cat; cat and sheep; dog and cat; cat
Now I have some issues to find a good measure for inter-rater agreement.
One easy measure I calculated is to determine the union (at least one animal has been found in the picture by both raters) and the intersection (raters found exactly the same animals in the picture). The problem is that they are not statistical measures.
The best solution I found, is to calculate Cohen's kappa for each of the 18 values, for each couple of raters.
So:
sheep matrix:
1 (agreement: a sheep in the picture);
1 (disagreement);
1 (disagreement);
3 (agreement: no sheeps in the picture)
dog matrix:
1 (agreement: a dog in the picture);
3 (disagreement);
0 (disagreement);
1 (agreement: no dog in the picture)
I find these very hard to interpret, especially because in the real data the fourth case (agreement on the absence of the animal in the picture) is very often a very high number (some animals were found in about 1 picture out of 100). It appears to me that this makes the kappa value not appropriate for this application - but please correct me if I am wrong.
I am looking for a better measure, possibly an aggregated measure for all the classes (and not one by one).