1

Consider a dataset where each instance was annotated by a number of participants that had to category the instance (a, b, or c). To calculate some form of agreement between these participants, we could calculate the average entropy over all instances. My question is, is that still a fair thing to do when not all examples have been annotated by the same number of participants? Say that example_1 has been annotated by three participants, and example_2 had five participants, is it then still fair to compare the entropy values between these two or take the unweighted average?

Example dataset:

annotator_1 | annotator_2 | annotator_3 | annotator_4 | annotator_5 
-------------------------------------------------------------------
a             b             a             a             c
a             a             b
a             b             a             c
b             b             a             c             a
Bram Vanroy
  • 147
  • 2
  • 13

2 Answers2

2

Yes it is, if you think so.

Entropy is computed from an estimated probability distribution. There is no way to know what was the probabilities for the annotators to chose each category, if they didn't pick one. So you can assume that the unobserved distribution for each missing observation corresponds to the observed ones (those that were chosen by the other annotators).

This solution is natural if you think, as you may do, that if the annotator didn't decide any category, it was because he/she was willing to give this choice to someone else, ready to accept their decision. In that case, it makes completely sense to compute entropy simply discarding missing values, they are not important.

What if you don't think it this way?

In statistics, the choice of the best solution for some problem really depends on how you read data. Missings, more than anything, can be read in different ways.

You may think that the trivial method above biases estimates, because missing values express doubt, not tacit agreement to the other annotators' choice, whatever it was. In this case you could add a kind of bayesian prior to each distribution giving equal chances to each category, to do so just add three columns to you dataset, with all their values equal to one of the category. The computed entropies will all be higher, but this will affect mainly the instances with more missings, instead those with few missings will have their entropy raised less. This way the agreement among instances may be ranked in a different order, with a preference for instances with less missing categorizations.

Example:

                                                                    |  simple | adjusted
annotator_1 | annotator_2 | annotator_3 | annotator_4 | annotator_5 | entropy |  entropy
-----------------------------------------------------------------------------------------
          a             a             a             a             c       0.5      0.9 
          a             a                                                   0      0.95

Here the second row expresses better agreement if you simply drop the missings, but worse agreement when adding the "uncertainty prior", which is the three virtual different values.

Which one between these two solutions is better, only depends on your take on the data, unless this is part of some model of which performance you can assess on a hold-out set. I don't think this choice will change much anyway.

carlo
  • 4,243
  • 1
  • 11
  • 26
1

I recommend a measure other than entropy.

The plug-in estimate for entropy based on a set of observed probability values over a set of classes has an inherent downward bias that depends both on the number of classes and on the number of observations. Answers and comments on this thread discuss that problem and provide some links for further reading. So the magnitude of the bias in the estimates will differ if the number of observations differ. I don't think that would be what you want.

EdM
  • 57,766
  • 7
  • 66
  • 187