Should "day forward-chaining nested cross-validation" actually be used to evaluate time series forecasts?

Question

I'm very familiar with the standard test/train approach to forecasting, i.e. https://otexts.com/fpp2/accuracy.html

This question concerns day forward-chaining nested cross-validation, see https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9, which has been proposed to evaluate forecast model predictions when model hyperparameters need tuning.

It works by first temporally splitting a time series in a training and test set. The test is further partitioned into sub-test sets, while maintain temporal orders. This step is simply the standard approach.

Where "day forward-chaining nested cross-validation" differs from the standard approach is that the training set is also partitioned into a sub-training set and validation set (again maintain temporal orders, with the sub training set taking the first partition.).

An algorithm is trained on the sub-training set, and it's hyperparameters are determined via the validation set. Once optimal hyperparameters have been found, the algorithm is re-trained on the full training set (sub-training + validation sets). The model performance is evaluated against the first sub-partition in the test set.

The first portion in the test set is then absorbed into the full training set, and the process repeats. The second model is now tested against what was originally the second portion in the training set.

This process repeats until there are no more test partitions left.

Some metric (RMSE, for example) is used to evaluate each models performance against the sub-partition test sets. And for each sub-partition test set, there is a corresponding evaluation value. These evaluation values are now averaged over to give an overall model performance.

First of all, have I understood the approach correctly? My issue with this approach is how to interpret the overall model performance value? I don't have any specific problem in mine; I'm just thinking allowed. One is averaging over essentially entirely different models. I.e. each model may end up having a unique set of hyperparameter settings. So is it really fair to say that this is the models performance? Perhaps I'm getting hung up on the word model, but one is essentially ensembling over different models. I suppose that if each of your models has wildly different hyperparameters then this is a signed that you would perhaps look at other algorithms that are better suited to modelling the time-series characteristics. So maybe this is a non-issue. Does one by default expect little change in the hyperparameter values?

And finally, in production does this not mean that one needs to periodically retain the algorithm, ideally at a period equal to the original sub-partition test sets length, via Day forward-chaining nested cross-validation? And presumably continuously re-evaluated once future data becomes available?

I'm sharing your questioning about the averaging of errors coming from different models, in particular as they have been trained on training sets with various sizes. We may thus expect that the first trained model will perform less than the last model which has been trained on more data. In this scope, how to choose correctly the minimal (first) training set size to limit the associated induced variance in the final error calculation ? — qcha, Jan 16 '21 at 10:47
(I found a [post](https://stats.stackexchange.com/questions/342483/establishing-the-minimum-required-training-set-size-when-cross-validating-time) that also ask the question) — qcha, Jan 16 '21 at 10:59
Hi qch4, yes that's also an excellent point: the training size for each model has different lengths. Great thank you'll take a look at the link! — questions.cubed, Jan 16 '21 at 15:14

Should "day forward-chaining nested cross-validation" actually be used to evaluate time series forecasts?

0 Answers0