Is it valid to reduce noise in the test data from noisy experiments by averaging over multiple runs?

Question

I have data from a biological (fMRI) experiment that was previously evaluated with a different model in a machine learning fashion, using a training and a validation set in a cross-validation routine for finding regularization parameters and a test set for the final evaluation. The data is used to predict the brain activity of single voxels in time, based on a set of stimulus features for each time point.

The experimenters have made sure to acquire a relatively long training data set for getting a good regularized regression model in the end. However the experimenters had also decided to record the much shorter test data set multiple times (at several time points during the experiment) and evaluate on an average between all runs. I can understand this decision since BOLD data from fMRI experiments is very prone to noise, e.g. from subjects not paying attention, adapting too much or getting drowsy throughout the experiment, i.e. the test set recording can easily be corrupted by random influences.

My question: To what extend can this decision - recording the test data set multiple times and using its average for evaluation - be criticized from a machine learning perspective? I would like to stick to their routine for comparability, but always stumble over this question.

What I notice during my own model training is a model performance increase in the test set (which is no big surprise to me since the noise is gone) in comparison to the training set performance. This is almost consistent, however there are cases when the model only works on the training dataset and not on the test set at all (i.e.: I think it is possible to still spot overfitting).

So the training data is raw data and the test data has been averaged across multiple runs? Seems backwards. — Wayne, Sep 19 '16 at 19:09
The training data represents the data you have before deploying your model, and the test data represents data you will get after it's deployed. In general, real-world data will be noisier than your training data -- which you can work on in more detail, get additional information, etc. You're testing on "cleaner" -- more smoothed, actually -- data than you trained on, which seems unusual. (I'd also be interested in whether the averaged data is validation or test data, or both.) — Wayne, Sep 20 '16 at 17:05

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

I think that the experimenters' decision fits into general resampling statistical strategy. Having said that, I'm not sure what specific aspects, if any, might be used to criticize this approach from the machine learning perspective.

In regard to reducing noisy data, while I'm not sure how applicable it is in your subject domain, you might want to check my hopefully relevant answer. Moreover, I think that it might make sense to use clustering to detect and eliminate noisy data by applying bootstrapping technique. Please see my answer on using bootstrapping for clustering.

Is it valid to reduce noise in the test data from noisy experiments by averaging over multiple runs?

1 Answers1