Standard practice is to split data into a train/test set, then use the train set for hyperparameter tuning / model selection, using for example cross-validation over the whole training set. Finally, using a fixed model to check performance on the hold-out test set.
With small datasets this is a big limitation, as a too small test set will be subject to high performance variance, or too small training sets will not be enough to train a decent model.
To solve this, one can also do this process iteratively by then splitting the data again into different train/test splits, and re-do the process. Repeat until we have used all data as test-sets with different models, then average performance over them.
Is this methodology correct / unbiased? Or are there other better alternatives?