0

I have a data analysis puzzle involving ROC curves that I hoped you could help me with.

One of my research projects involves exploring how to use crowds to do idea filtering (i.e. to distinguish good ideas from bad ones). In our recent experiments, we asked 50 people to provide ratings for each of 8 ideas. Three of these ideas were identified by experts as being good, the remaining bad. We tried several different rating schemes e.g. rating ideas on a scale of 1 - 5, selecting only the 3 worse ideas, etc).

For each rating scheme, we produced an aggregate score for each idea that is the sum of the scores the idea got from the 50 crowd members e.g.

rating_scheme score(idea1) score(idea2) ... score(ideaN)
Likert 71 68 50
worst3 7 15 11

We can then of course use ROC analysis to assess the accuracy of each rating scheme, using the expert ratings as the gold standard.

The puzzle involves determining whether there is a statistically significant difference between the accuracy of the different rating schemes.

We can of course use something like http://vassarstats.net/roc_comp.html to do this. But we don't know what we should enter for the number of actually negative and actually positive cases. There are only 8 ideas, 3 good and 5 bad, so maybe those numbers should be 3 and 5. But this doesn't account for the fact that we had 50 raters in each condition. Surely the statistical significance of a given AUC difference between two conditions should be greater if we ran, in effect, 50 distinct tests per idea, instead of just 1. If so, how do we set up the analysis to reflect that properly?

Did I explain the question clearly? Do you have any guidance on how we can resolve it? It will be greatly appreciated because right now we are kind of stuck.

Calimo
  • 2,829
  • 17
  • 26
Lukgaf
  • 1
  • 2
  • 1
    AUROC is the concordance probability (c-index) which is a linear translation of both Somers' $D_xy$ rank correlation between predicted and observed, and the Wilcoxon test statistic. Differences in AUROCs correspond to differences in Wilcoxon statistics. We don't use differences in rank statistics to test things because this loses power. I suggest finding another approach. – Frank Harrell Sep 22 '21 at 11:21
  • Thanks, @FrankHarrell, please could you suggest other approaches I could use to check this. Thanks – Lukgaf Sep 22 '21 at 12:23
  • To start, https://fharrell.com/post/addvalue – Frank Harrell Sep 22 '21 at 16:55
  • Closely related: https://stats.stackexchange.com/q/358101/22311 – Sycorax Sep 22 '21 at 17:19
  • https://fharrell.com/post/addvalue – Frank Harrell Sep 23 '21 at 21:56
  • Thanks. These are very useful. I will check them out. – Lukgaf Sep 27 '21 at 12:25

0 Answers0