0

Consider a set-up in which we are using machine learning to classify between healthy and diseased samples. Obtaining the data requires some invasive procedure - therefore all the healthy samples come from the same patients as the diseased samples, just from a different location. We therefore have paired data samples.

I'm considering what, if any, consequences such a set-up would have for the generalisation of a model trained on such data.

If we are performing hypothesis testing such a paired set-up would be beneficial as each patient provides their own control, thus reducing variance. In the context of machine learning this might help to train a model with 'better' performance metrics, but would it not suffer when applied outside its training range as it has not been exposed to a representative range (maybe this depends on the sample size, but typically medical datasets are small)?

If this is a concern, are there any cross-validation techniques to deal with this. For instance you could try to split the train/test set such that only unmatched samples are in each set.

N Blake
  • 539
  • 3
  • 8

0 Answers0