I am doing a research about music emotion recognition. In my research, I use a two-dimensional (Valence and Arousal) space to represent the emotion of a song. To validate my approach, I have to create a ground truth of music emotion. Most of literature I have read hired about 5 people to annotate the emotion of a song from the list of songs in question. I wonder that whether there is a study which examined or specified the number of annotators for a music emotion recognition task? I have searched for that study but I can not find it. Thank you for all your answers!
1 Answers
There are two questions when choosing the number of annotators:
- How accurate are they
- How much they disagree
For example, if all annotator give similar answers, one will be enough. In most real world situations, most human annotator are quite accurate (but not prefect) and tend to agree with each other but not always.
In such cases you can aggregate the results of the different annotators in order to improve your labeling. A classical way to do so is to use a Dawid-Skene estimator (Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm).
In this method we use the performance of each annotator. Note that using a majority vote might not be good enough since you might have a very accurate annotator disagreeing with many bad ones. Given the performance you estimate the labels and then re estimate the annotator. This technique is called Expectation maximization.
Please note that in many cases, using this technique will be an over kill. Instead, estimate the performance of each annotator with respect to some gold standard. After that, give the majority rule a try. If the majority rule results satisfy you, you are good to go. You might even find out the 3 annotators will be enough.

- 4,462
- 3
- 16
- 27