How to select the length of a time series when fitting models for prediction

Question

Let say that one wants to fit a model to a daily financial time series for prediction (e.g. ARIMA, SVM). If data are stationary, ideally the longer the time series, the better. In practice, I don't feel comfortable in blindly trusting stationarity tests (e.g. KPSS, ADF). For example, a 90% KPSS and ADF confirm that the following time series is stationary when it qualitatively doesn't seem to be homoscedastic. Which quantitative methods exist to identify a reasonable starting date of the time series in terms of quality of the prediction (i.e. minimum test error, low variance of the prediction)? Please refer to R packages when possible.

My attempts:

(i) A brute force approach could consist in repeating the fitting for any length of the time series of interest (e.g. 1y, 1y+1d, ..., 5y). Anyway, this approach is too expensive.

(ii) Perform stationarity tests (ADF, KPSS) to the time series of minimum allowed length and extend the length until the tests reject the stationarity. The problem of this approach are multiple: (a) extremely dependent to the confidence of the test (e.g. 95% or 80%). (b) stationarity tests are not able to identify change of regime that may occurs for long financial time series.

Strictly related topic, but it doesn't provides automatic/quantitative procedures: Length of Time-Series for Forecasting Modeling

EDIT (2/Jul/2016): After further thoughts, I think that an optimal approach could be to follow the principle "the larger the dataset, the better". After all, a model that is highly dependent on the length of the time series I guess that it could be considered a "bad" model. Rather than focusing on the selection of an optimal length, one could focus on the identification of features that are able to work well under different regimes of the time series.

Neither the [KPSS](https://en.wikipedia.org/wiki/KPSS_test) nor the [ADF](https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test) test test against nonstationarity *in variance* - the KPSS tests for trend stationarity, the ADF for integration. So it's little surprise that they don't tag your series as heteroskedastic. Have you tried simply fitting (G)ARCH to your entire series? If so, what is the remaining problem? — Stephan Kolassa, Jun 12 '16 at 12:54
Thank you for spotting the fallacy on KPSS and ADF. Anyway, how to determine an optimal length of the time series still remains an unsolved problem. Following your suggestion, let say that we fitted a GARCH on the time series of length 1y, 1y+1d, ..., 5y. How would you proceed then? — Elrond, Jun 12 '16 at 13:34
I have misgivings about the entire enterprise of artificially shortening the history, unless you have external information about changes in the data-generating process. We would shorten the history to obtain better forecasts, but this requires predicting what is going to happen in the future, e.g., if and when there will be a regime switch (a history of length $T_1$ may yield optimal forecasts for future periods $1\leq t\leq h_1$, but a history of length $T_2\neq T_1$ may be optimal for $h_1 — Stephan Kolassa, Jun 12 '16 at 13:39
I don't think that we necessarily need to "predict what is going to happen in the future" to determine an optimal length of the time series. For example, if the whole time series presented two distinct regimes (e.g. change in distribution of the underlying process for $t>t^*$), then selecting the subset of the time series corresponding to the most recent regime would do the trick. As you know, in practice it's not trivial to detect a change point as in the example. For this reason, I'm looking for suggestions for automatic procedures/tests to identify an optimal length of the time series. — Elrond, Jun 12 '16 at 14:39
That would still presuppose that the current regime continues for the indefinite future, and that the regime *before* that structural break won't be coming back. I'd say that having seen different historical regimes, our forecasts should include the possibility of *future* regime changes. Which they won't if we exclude data before the last structural break. — Stephan Kolassa, Jun 12 '16 at 18:53
I understand your point. Anyway, when using a regime switching model we still need to select automatically or arbitrarily the length of the time series. For this reason, I still believe that identifying somehow an optimal length remains of paramount since it can hugely affect the quality of the fit and it is a step performed offline. In other words, why leaving the length of the data to an arbitrary choice and ignoring a potential higher quality of the fit? — Elrond, Jun 12 '16 at 20:50

How to select the length of a time series when fitting models for prediction

0 Answers0