Scenario: I am conducting a study in which I am comparing a ‘gold standard’ assessment protocol with an experimental assessment protocol for a health variable of interest, over a common set of study participants. Observations will be paired, which is to say, for each participant in the study, I will have observations from both the ‘gold standard’ and experimental assessment protocols. Both protocols will generate assessments using a common nominal categorical variable (which takes the same three possible values, which I’ve coded as 0, 1, and 2 in the following example).
Analytic topic of interest: I would like to be able to meaningfully characterize the extent to which the results of the experimental assessment protocol correspond (i.e. agree) with those of the ‘gold standard’ assessment protocol.
Statistical question(s) of interest: What measure(s) should I consider using to characterize the level of agreement across the two protocols? Would a measure of inter-rater reliability, such as Cohen’s Kappa, be appropriate here? Are there other measures that I should consider? Measures that can also be associated with a hypothesis test and p-value would be ideal. As a partial aside, I’ve had colleagues propose that I use chi-square to test for independence of observations from the two protocols and Cramer’s V to characterize levels of agreement, but I am concerned that the paired nature of the observations would contraindicate those measures. Is that concern misguided?
I will have on the order of 1000 study participants. My data are currently structured as illustrated in the following sample table, and I expect to use R or STATA to run analyses.
Suggestions (or pointers) greatly appreciated.
subject Protocol1 Protocol2 1 2 2 2 0 0 3 2 1 4 0 1 5 2 1 6 1 1 7 1 0 8 0 0 9 0 1 10 1 2 11 2 1 12 2 0 13 2 0 14 2 2 15 0 1 16 0 0 17 0 2 18 2 0 19 1 1 20 0 0