I have data from a biological (fMRI) experiment that was previously evaluated with a different model in a machine learning fashion, using a training and a validation set in a cross-validation routine for finding regularization parameters and a test set for the final evaluation. The data is used to predict the brain activity of single voxels in time, based on a set of stimulus features for each time point.
The experimenters have made sure to acquire a relatively long training data set for getting a good regularized regression model in the end. However the experimenters had also decided to record the much shorter test data set multiple times (at several time points during the experiment) and evaluate on an average between all runs. I can understand this decision since BOLD data from fMRI experiments is very prone to noise, e.g. from subjects not paying attention, adapting too much or getting drowsy throughout the experiment, i.e. the test set recording can easily be corrupted by random influences.
My question: To what extend can this decision - recording the test data set multiple times and using its average for evaluation - be criticized from a machine learning perspective? I would like to stick to their routine for comparability, but always stumble over this question.
What I notice during my own model training is a model performance increase in the test set (which is no big surprise to me since the noise is gone) in comparison to the training set performance. This is almost consistent, however there are cases when the model only works on the training dataset and not on the test set at all (i.e.: I think it is possible to still spot overfitting).