Does 'proper' experimental design, in terms of ensuring scientific validity, require that the same skilled individual perform all of the measurements?

Question

I'm not sure if this is on-topic here. If not, then please tell me and I'll delete it.

I have a question about the proper way to design biomedical experiments that involve human-administered measurement. Let's take a hypothetical scenario where we're performing an experiment that requires taking measurements of various parts of human anatomy. Furthermore, let's assume that it is difficult to take such measurements and requires some significant degree of 'human judgement', so it requires the 'measurers' to be individuals of a certain skill/experience/training/qualification/whatever. Does 'proper' experimental design, in terms of ensuring the scientific validity of the experiment/research, require that the same skilled individual perform all of the measurements? What if the measurements were taken using two skilled individuals? What about three? Etc.

I have taken some time to think about this, and it seems to me that what I'm fundamentally thinking about here is error. I am not experienced when it comes to experimental design, but it seems to me that there are a few dimensions to this:

The first and primary, dimension is whether having more than one person taking measurements is problematic. Different individuals – even skilled ones – are bound to be of differing skill levels and introduce varying amounts and types of error (I don't necessarily mean types of error in the statistical sense; it could be error from interpreting the anatomy in a slightly different way, or using the measurement device in a slightly different way, or just not having slept well the night before, etc.).

At the end of the research, all of the data is aggregated together, with no distinction between data from one measurer vs data from another measurer. This means that we are then trying to draw conclusions from data that has this 'mishmash' of errors, rather than just having a single person do all of the measurements so that all of the data contain a consistent pattern of errors.

It seems to me that the fact that all of the data have a consistent pattern of error means that the data is effectively 'normalised' in this way, and so drawing conclusions from it would be more valid/meaningful than the mishmash of errors case.
The second dimension is the number of people/subjects participating in the experiment (that is, our sample size). This is because, even if we have multiple people taking measurements and introducing their own errors, leading to inconsistent patterns of errors in the data, I wonder if just having a larger amount of data/measurements would lead to some kind of statistical effect that makes drawing conclusions more valid/meaningful.
The third dimension is the relative/proportional number of measurements taken by each measurer. If we have a total of 10 subjects and two measurers, and each measurer measures 5 subjects, then it seems to me that the problem explained in 1. would be maximised, whereas, if we had one measurer measure 9 subjects and the other measurer measure 1, then we're getting close to the result of just having a single measurer.

score 0 · Accepted Answer · answered Mar 19 '21 at 19:24

A very valid question.

It why every DOE textbook talks about randomized experiments. If the experiments are performed in a random order then the error introduced by measurer 1, 2, 3 etc. will be randomly distributed across all the experimental runs and not confounded with an experimental factor of interest.

Depending on the size of the experiment and the number of measurers. There are few ways to handle this.

If there is a small number of sample and a pair measurers then a measurement could become a factor and added to the experimental design.
If there are few measurers, they could become a blocking factor to the design.
If there are a large number of samples and a more measurers, the random assignment of each experimental condition then the error is mixed into the repeatability/reproducibility term.

Maybe you could also mention [tag:agreement-statistics] and the possibility to actually *measure* the degree of agreement, and if not good enough, maybe look into extra training of the measurers. — kjetil b halvorsen, Mar 20 '21 at 00:30

Does 'proper' experimental design, in terms of ensuring scientific validity, require that the same skilled individual perform all of the measurements?

1 Answers1