I'm very familiar with the standard test/train approach to forecasting, i.e. https://otexts.com/fpp2/accuracy.html
This question concerns day forward-chaining nested cross-validation, see https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9, which has been proposed to evaluate forecast model predictions when model hyperparameters need tuning.
It works by first temporally splitting a time series in a training and test set. The test is further partitioned into sub-test sets, while maintain temporal orders. This step is simply the standard approach.
Where "day forward-chaining nested cross-validation" differs from the standard approach is that the training set is also partitioned into a sub-training set and validation set (again maintain temporal orders, with the sub training set taking the first partition.).
An algorithm is trained on the sub-training set, and it's hyperparameters are determined via the validation set. Once optimal hyperparameters have been found, the algorithm is re-trained on the full training set (sub-training + validation sets). The model performance is evaluated against the first sub-partition in the test set.
The first portion in the test set is then absorbed into the full training set, and the process repeats. The second model is now tested against what was originally the second portion in the training set.
This process repeats until there are no more test partitions left.
Some metric (RMSE, for example) is used to evaluate each models performance against the sub-partition test sets. And for each sub-partition test set, there is a corresponding evaluation value. These evaluation values are now averaged over to give an overall model performance.
First of all, have I understood the approach correctly? My issue with this approach is how to interpret the overall model performance value? I don't have any specific problem in mine; I'm just thinking allowed. One is averaging over essentially entirely different models. I.e. each model may end up having a unique set of hyperparameter settings. So is it really fair to say that this is the models performance? Perhaps I'm getting hung up on the word model, but one is essentially ensembling over different models. I suppose that if each of your models has wildly different hyperparameters then this is a signed that you would perhaps look at other algorithms that are better suited to modelling the time-series characteristics. So maybe this is a non-issue. Does one by default expect little change in the hyperparameter values?
And finally, in production does this not mean that one needs to periodically retain the algorithm, ideally at a period equal to the original sub-partition test sets length, via Day forward-chaining nested cross-validation? And presumably continuously re-evaluated once future data becomes available?