I have a dataset, which contains 10 folds. The authors of the paper have created these 10 folds and in each fold there is a training set $D_{tr}$ and a test set $D_{te}$ (obviously, for each fold $D_{tr} \cup D_{te}$ will result in the full dataset $D$.
I have built a nested-CV model that does the following: It takes the $D_{tr}$ of a fold, evaluates a set of models (for searing best hyperparameters) using 5-fold CV, and returns the best set of hyper-params ($P^*$). Next, I use this parameter set $P^*$ and construct a model $m^*$, train it with $D_{tr}$ and evaluate the trained model on $D_{te}$. After repeating this for all the 10-folds, I have 10 different models and 10 different test estimates (i.e. model and score for each fold). I average the score of 10 folds and note it down.
Procedurally, it looks like this:
$for$ fold $i = 1$ to $10$
$~~~~for$ each hyper-param $P_j$ (let $j \in [N_p]$)
$~~~~~~~~$Evaluate the performance of model $m$ built with $P_j$ using 5CV on $D_{tr}$ (=$iscore_j$)
$~~~~best = \arg \min_j {iscore_1, iscore_2, \ldots, iscore_{N_p}}$
$~~~~P^* = P_{best}$
$~~~~$Build model $m$ using $P^*$ and $D_{tr}$ (call this $m^*$)
$~~~~$Evaluate the performance of $m^*$ on $D_{te}$ (call this $score_i$)
Report $\frac{1}{10}\sum_{i=1}^{10}{score_i}$
Note: The term 'build' means training.
You can see that:
- In this way, I am not touching the test set of each fold ever and thereby getting an unbiased estimate.
- Each fold gives a model and a score.
- But, I am incurring huge computational resource challenges and time.
Is it the way things should be done, or I am missing something that can make things less expensive?
Most related question: The procedure that I have written above is what is described in the accepted answer of this question. It seems this is the standard nested-CV procedure and this makes me happy and sad both.