IMHO, set validation vs. cross validation of a model is mostly a matter of choice (and it doesn't really matter whether that model internally uses further splits during training).
The IMHO main pitfall is: does your splitting procedure actually achieve statistical independence between the sets? But this the same for set and cross validation.
The same reliability of results can be achieved with set or cross validation. In both cases you need to take more care than is customary - but in slightly different respects.
The underlying difficulty is that we have more than one source of uncertainty.
This has consequences wrt. your motivations:
- 1) the overall uncertainty will depend on the largest source of uncertainty, reducing the others cannot bring down total uncertainty to any practical extent.
- 3) hypothesis tests should be constructed in a way that takes care of these multiple sources of error (uncertainty).
For models that are to be used for production, i.e. where we want to estimate the generalization error of the model trained from the sample at hand, both set and cross validation have two sources of random error: the finite number of independent cases that are tested and model instability.
If the generalization error of a model trained from a sample of size $n$ of the population the sample at hand comes from is needed, we have additional uncertainty which cannot be estimated by set or cross validation. If your "comparing sources of data" relates to this, you'll either have to live with an unknown source of uncertainty that hampers your possibilty to draw conclusions or you need to take yet radically different approaches.
I'll continue with the "doable" task of estimating generalization error for a model to be used in production (which happens to be the main "mode" of my model validation work).
We can estimate both model instability and uncertainty due to the finite number of tested cases from set or from cross validation.
To get more reliable, i.e. less uncertain results wrt. test sample size error, we need to test as many cases as possible.
- Cross validation has the advantage of guaranteeing by construction that after $k$-folds each case has been tested.
- When randomly reserving a fraction of $\frac{1}{k}$ of the cases for testing in a set validation, you'll need more than $k$ iterations to have each available case tested at least once (unless you accidentally hit a series of splits that is equivalent to a run of $k$-fold cross validation).
Model instability is a bit different. Usually we want the models to be stable.
Again, we can improve our generalization error estimate by obtaining an estimate from as many different surrogate models as possible. However, in practice we often only want to establish that our overall uncertainty is not the dominating source of error. This often doesn't need that many surrogate models and thus iteratations or repetitions.
For cross validation, you should repeat, i.e. do several runs of k-fold CV. This is called repeated (aka iterated) k-fold cross validation and is sometimes described as i × k-fold CV.
I often start with a small number (say, 3) of repetitions. After that, I have a series of 3 estimates from different surrogate models for each of my cases. While an estimate of the variance due to model instability will still be very uncertain with just 3 repetitions, it is still often sufficient to either establish that model instability is not a problem here (after which further repetitions would not help), or that the models are so unstable that I anyways need to go a step back and change the training procedure.
In the intermediate case and in case one decides to stabilize by model aggregation, one can also decide to go on and do more repetitions.
For set validation, there will be some overlap in the tested cases almost from the beginning. But again, you need to monitor that you have sufficiently many surrogate models to get both estimates with an (un)certainty that allows you to draw meaningful conclusions.
In the end, set validation may be a bit easier for fine-tuning the number of surrogate models (i.e. you can do 50 repetitions with 1/8th of the data split off, but with 8-fold CV, you need to decide between 6×8-fold i.e. 48 surrogate models and 7×8-fold = 56 surrogate models).