How to perform inter-rater reliability with multiple raters, different raters per participant, and possible changes over time?

Question

Participants were rated twice, with the 2 ratings separated by 3 years. For most participants the ratings were done by different raters, but for some (< 10%) the same rater performed both ratings. There were 8 raters altogether, with 2 doing ratings at both time points.

Now, since the ratings were of an aspect of ability with a hypothetical "correct" value, then absolute agreement between raters is of interest, rather than consistency. However, since the ratings were taken 3 years apart, there might have been (and probably was) some real change in the ability.

What would be the best test of reliability in this case?
I'm leaning towards an intra-class correlation, but is ICC1 the best I can do with these data?

score 5 · Answer 1 · answered Jun 25 '12 at 17:50

5

How are you planning to account for the fact that some ratings were done by the same rater? Off the top of my head, I can't think of any measures that take that into account when it isn't consistently done. After all, if you compare the same rater twice, you're looking at consistency; if you compare two raters, you're looking at agreement. So, when you say you want to evaluate the "reliability", it's not totally clear what you are seeking to evaluate.

If you believe that the ability level of the subjects is likely to have changed, it's also important to consider how you can account for that fact. Do you have some gold-standard measurement to compare the raters against?

So, in summary, before you can assess how reliable the raters are, you need to answer two key questions:

How can you quantify and correct for change between the timepoints attributed to legitimate changes in ability, instead of poor consistency in rating?
Are you principally interested in how often the raters agree with each other, or in how consistently they apply the ratings?

answered Jun 25 '12 at 17:50

TARehman

269
1
8

Thank you for your reply, TARehman. Like you say, it is a mix. I think I will need to split the sample into those who were rated by the same rater and those who were not. Then I'll run separate analysis on both (intra- and inter-reliability, respectively). As to the problem of actual change, I don't think there is a way – Joyce Jun 26 '12 at 11:43
It seems to me that if you split them, you might be able to aggregate the final score into a meta-reliability, but such a plan is fraught with methodological challenges. I think ICC is going to be your best option. What are your numbers like (how many ratings, individuals, etc - we know you have 8 raters)? – TARehman Jun 26 '12 at 14:49
So, for each individual n=800 there are 2 ratings. There were 5 raters rating at t1 and 5 at t2 (8 altogether, with 2 rating at both t1 and t2). 100 individuals were rated by the same rater at both time points and 700 had different raters. I can't figure out which ICC would be most appropriate here... – Joyce Jun 27 '12 at 09:54
Well, it sounds like you can address one of the two bullet points above by ignoring the 100 individuals who were rated by the same rater at both time points. That at least addresses the question of what you are interested in: how often they agree with each other, or how consistently they apply the ratings. Because you do not have any way to control for the expected change in values over time, I'm still not sure how you will address the first point... – TARehman Jun 28 '12 at 17:28
Edited to add: You may find the Wikipedia discussion on this aspect of the ICC to be illustrative: http://en.wikipedia.org/wiki/Intraclass_correlation#Use_in_assessing_conformity_among_observers. In particular, it states that it can be "used to assess the consistency, or conformity, of measurements made by multiple observers measuring the same quantity." Since the raters in this case are not measuring the same quantity, the ICC may not be well-suited to your situation. – TARehman Jun 28 '12 at 17:33

How to perform inter-rater reliability with multiple raters, different raters per participant, and possible changes over time?

1 Answers1