You can (and should) bootstrap the train
/test
split (aka the "outer" split).
the test
set should not be touched by any means
The point is that the test data needs to be independent of all data that entered the training process in any way. Hyperparameter tuning (selecting based on intermediate test results on the dev
set) is part of the training process.
Not touching (looking into) the test
set prevents one important way of violating independence: the test
set entering the training process by driving decisions about the training.
However, there are other ways how independence may be violated (by structure in the data), and sometimes setting a test
set away [possibly even in a physical fashion] once can be a good way to ensure that independence wrt. to factors that could violate independence.
However, if you obtain the test
set by (randomly) splitting the initially available data set, there is nothing that should keep you from repeating this process to assess the stability of the whole procedure: the single test
set isn't any more independent of the training data than the repeated test
sets are.
If you move to a cross validation procedure for this split, you get what is called nested cross validation, with the outer and inner split both being done via cross validation.
Bootstrapping or rather, out-of-bootstrap testing for the outer split uses drawing with replacement for the train
set and is thus not the repeated version of the usual train
/test
split which uses drawing without replacement.
There is nothing wrong with using this approach, but it should be clearly marked that there is this difference in the sampling procedure which does affect the model.