3

I have seen multiple tutorials [example link] for ARIMA where they select the p,q,d parameters for it based on the whole time series. Then, after deciding on the model parameters they want to use, they split the data in training and test and make predictions for the test set to see how the model performs.

Shouldn't the p,q,d parameters be selected on the training data only, to ensure the elimination of any bias in the test set performance evaluation?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
MattSt
  • 340
  • 2
  • 8

1 Answers1

2

Yes, of course. ARIMA models are no different than any other model. The workflow is always to first split your data into a training and a testing sample (for time series data, you of course always use the last observations for the test), then fit the model to the training data, then evaluate predictions on the test set. In-sample measures of fit are almost meaningless.

And of course the model fitting step also includes determining the ARIMA orders, which would therefore be done based on the training data only. Just as in fitting an OLS model, we would determine any transformations or interactions needed based on the training data, not the entire dataset. This is standard practice by (sorry) real forecasters, see any issue of the International Journal of Forecasting.

Incidentally, the procedure outlined in that tutorial for determining the AR and MA orders is iffy. ACF/PACF plots can only be used in this way for "pure" AR(p) or MA(q) models. In any case, one nowadays uses a search over possible models based on information criteria, rather than the earlier Box-Jenkins approach. This is implemented in the forecast and fable packages for R. I recommend Forecasting: Principles and Practice (2nd ed.) by Athanasopoulos & Hyndman and Forecasting: Principles and Practice (3rd ed.) by Athanasopoulos & Hyndman.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks for the reply! Do you think it makes sense to still use a Box-Jenkins approach to determine a good range for the search of parameters based on information criteria? – MattSt Oct 05 '21 at 10:54
  • 2
    To be honest, I don't think so. As I wrote, you can't use it for mixed ARMA(p,q) models, and `forecast::auto.arima()` will do a better job for the small-ish number of realistic models ([for which there are good reasons](https://stats.stackexchange.com/q/285093/1352)). – Stephan Kolassa Oct 05 '21 at 11:18
  • Shouldn't the information criteria still be calculated based on the likelihood of the model on the test data though, and not on the training data? I am under the impression that the information criteria are usually calculated using the training data. – MattSt Oct 05 '21 at 12:46
  • 1
    No, information criteria are always used in model *fitting*, not *evaluation*. At least I have never seen them calculated on test data. – Stephan Kolassa Oct 05 '21 at 12:54
  • 1
    @MattSt, see e.g. ["Can AIC be used on out-of-sample data in cross-validation to select a model over another?"](https://stats.stackexchange.com/questions/429126/can-aic-be-used-on-out-of-sample-data-in-cross-validation-to-select-a-model-over/429131#429131) and ["Using AIC/BIC within cross-validation for likelihood based loss functions"](https://stats.stackexchange.com/questions/433605/using-aic-bic-within-cross-validation-for-likelihood-based-loss-functions). – Richard Hardy Oct 05 '21 at 14:49
  • @StephanKolassa: +1 Great answer. Wouldn't we want to determine whether the series is stationary based on the entire dataset and not just the training data? If so, shouldn't the ARMA order then be selected based on the entire dataset? – ColorStatistics Oct 05 '21 at 17:57
  • 1
    @ColorStatistics: good question. And I would still say that we want to stick to the training data. After all, why do we do the train/test split? It's to get a better idea of actual predictive performance. If there is nonstationarity that only becomes apparent with the test data (or vice versa), that makes me suspect the series may be prone to such effects in the future, too, so our actual predictive performance will suffer. If we "cheat" during model fitting, we will not be prepared for that. – Stephan Kolassa Oct 06 '21 at 05:02