the goal is to build a predictive model.
I read this as: a model that is then actually used for prediction, and we need to know the performance of exactly that model*
Independent Test Set or Hold Out Testing
Now, if you set up is that you have a training set that is used to build your model, and once that model is finalized its performance is evaluated with with a properly independent test set.
In that case, as the model is fixed, we need to account only for the variance uncertainty due to the limited number of tested cases - as always, if the performance estimate is based on measuring more cases, the uncertainty will be lower.
Thus, bootstrap your figure of merit from the test results.
Figures of merit that are proportions (0/1 loss, e.g. accuracy, precision, recall, sensitivity, ...) follow a binomial distribution, so you can also directly calculate confidence intervals this way. This is particularly useful as you can do that beforehand as back-of-the-envelope calculation to check whether your experiment can possibly result in a sufficiently narrow confidence interval for your figure of merit to be of practical use.
We've outlined such approaches in: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
Resampling validation: Cross Validation, Set Validation, Out-Of-Bootstrap & Co.
Resampling validation takes so-called surrogate models trained on a subset of the data at hand and tests them with the respective cases not used for that surrogate model's training. This is typically done for many surrogate models, and the test results are pooled and used as approximation for the performance of the final model which is trained with the same algorithm but on the whole data set.
In this case, the situation is more complex as we have to take into account:
Bias: as the surrogate models are trained on smaller subsets, they are usually a bit worse than the final model: this is the root cause of the slight pessimistic bias of resampling validation. Your confidence interval will be off due to this bias.
$k$-fold CV with not too small $k$ usually has low bias, while the bias of out-of-bootstrap can be more substantial and I've seen .632-bootstrap having optimistic bias.
However, depending on the application question behind this, this bias may not be too bad: I've been working a lot developing models for clinical diagnostic questions. In that case, I use cross validation (low but pessimistic bias) and can then say that my confidence interval will be a bit too conservative - which in this case is far more acceptable than a possibly overoptimistic estimate.
With some experience, you may be able to get an idea of the order of magnitude for your data.
We've studied this for small n large p situations that are typical in my field: Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
Variance uncertainty due to the limited number of tested cases: this is a bit more tricky now than above: as the results for all cases are pooled, you'd bootstrap test results from all cases. But that includes test cases from multiple surrogate models
and there is also variance uncertainty due to possible model (in)stability, which is caused by the limited number of training cases and possibly by non-determinism in the training algorithm ("variance source 2b").
As this is important information on it's own, you may want to directly measure this with repeated/iterated cross validation or bootstrap-based resampling, see e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6
With such a repeated cross validation or out-of-bootstrap or any of its variants (.632, .632+), your raw test results include both relevant sources of variance. But what we want is the distribution of the figure of merit that pools both sources: $n_t$ tested independent cases and $n_b$ surrogate models.
While I've not quite finished thinking though this, at the moment I bootstrap both $n_b$ out of $n_b$ surrogate models and $n_t$ out of $n_t$ test cases to construct my distribution for the figure of merit.
(I've presented a poster "C. Beleites & A. Krähmer: Cross-Validation Revisited:
using Uncertainty Estimates to Improve Model Autotuning" about this a few weeks ago, please do not hesitate to email me [see profile] if you'd like to have a copy)
* As opposed to: a model trained with this training algorithm on a data set (not this data set) of size $n$ of this general population => in that case, a proper estimate of the variance will need multiple data sets, resampling validation cannot estimate it, see
Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105.