I have a data analysis puzzle involving ROC curves that I hoped you could help me with.
One of my research projects involves exploring how to use crowds to do idea filtering (i.e. to distinguish good ideas from bad ones). In our recent experiments, we asked 50 people to provide ratings for each of 8 ideas. Three of these ideas were identified by experts as being good, the remaining bad. We tried several different rating schemes e.g. rating ideas on a scale of 1 - 5, selecting only the 3 worse ideas, etc).
For each rating scheme, we produced an aggregate score for each idea that is the sum of the scores the idea got from the 50 crowd members e.g.
rating_scheme score(idea1) score(idea2) ... score(ideaN)
Likert 71 68 50
worst3 7 15 11
We can then of course use ROC analysis to assess the accuracy of each rating scheme, using the expert ratings as the gold standard.
The puzzle involves determining whether there is a statistically significant difference between the accuracy of the different rating schemes.
We can of course use something like http://vassarstats.net/roc_comp.html to do this. But we don't know what we should enter for the number of actually negative and actually positive cases. There are only 8 ideas, 3 good and 5 bad, so maybe those numbers should be 3 and 5. But this doesn't account for the fact that we had 50 raters in each condition. Surely the statistical significance of a given AUC difference between two conditions should be greater if we ran, in effect, 50 distinct tests per idea, instead of just 1. If so, how do we set up the analysis to reflect that properly?
Did I explain the question clearly? Do you have any guidance on how we can resolve it? It will be greatly appreciated because right now we are kind of stuck.