Repeatability and measurement error from and between observers

Question

I have 3 observers that each take 2 measurements (length and weight) on 100 individuals; these procedures are repeated once (i.e., the same measurements are taken on the same 100 individuals by the same 3 observers), so that the data set is duplicated (i.e., early reading and late reading).

What is the best way to figure out how each individual observer's measurements vary between the late and early trial measurements?
How can I best compare how close or different the measurements of length (or weight) differ among the 3 observers?

score 8 · Answer 1 · edited Apr 13 '17 at 12:44

What you describe is a reliability study where each subject is going to be assessed by the same three raters on two occasions. Analysis can be done separately on the two outcomes (length and weight, though I assume they will be highly correlated and you're not interested in how this correlation is reflected in raters' assessments). Estimating measurement reliability can be done in two ways:

The original approach (as described in Fleiss, 1987) relies on the analysis of variance components through an ANOVA table, where we assume no subject by rater interaction (the corresponding SS is constrained to 0) -- of course, you won't look at $p$-values, but at the MSs corresponding to relevant effects;
A mixed-effects model allows to derive variance estimates, considering time as a fixed effect and subject and/or rater as random-effect(s) (the latter distinction depends on whether you consider that your three observers were taken or sampled from a pool of potential raters or not -- if the rater effect is small, the two analyses will yield quite the same estimate for outcome reliability).

In both cases, you will be able to derive a single intraclass correlation coefficient, which is a measure of reliability of the assessments (under the Generalizability Theory, we would call them generalizability coefficients), which would answer your second question. The first question deals with a potential effect of time (considered as a fixed effect), which I discussed here, Reliability in Elicitation Exercise. More details can be found in Dunn (1989) or Brennan (2001).

I have an R example script on Github which illustrates both approaches. I think it would not be too difficult to incorporate rater effects in the model.

References

Fleiss, J.L. (1987). The design and analysis of clinical experiments. New York: Wiley.
Dunn, G. (1989). Design and analysis of reliability studies. Oxford
Brennan, R.L. (2001). Generalizability Theory. Springer

score 3 · Answer 2 · answered Oct 16 '10 at 08:50

You need to repeat the same process separately for length and weight, as these are completely separate outcomes with different units and methods of measurement.

I'd start, as so often, by plotting some exploratory graphs. In this case a set of Bland–Altman (diffference vs. average) plots, one for each observer. If the plots for each observer look similar, I'd do a combined plot too. I'd look for any patterns in these plots, e.g. does the variability in the difference stay reasonably constant with the mean? (if not, i might consider some variance-stabilizing transformation). For each observer I'd then calculate the mean difference between early and late readings, to quantify whether there's a systematic difference, and the standard deviation of the difference as a way of quantifying how much each observer's measurements vary between late and early readings. I might then conduct a formal statistical test for the equality of the variances of the differences, such as the Brown–Forsythe test. If there's no strong evidence that the variances differ substantially between observers, I'd move on to ANOVA as I see has just been described by chl.

(+1) Thanks for mentioning the BA plot. You might be interested in an earlier response I made on a similar topic, http://stats.stackexchange.com/questions/527/what-ways-are-there-to-show-two-analytical-methods-are-equivalent/2834. — chl, Oct 16 '10 at 08:55
onestop and chi - thanks so much for the detailed explanations an dthe refs. I will try to go through the analyses as you described, and if I have more questions, I might pester you again - thanks again for the fast reply! — user1603, Oct 16 '10 at 21:56

Repeatability and measurement error from and between observers

2 Answers2

Linked