Clarifying lag number selection in AR,VAR, VECM etc. models

Question

When it comes to optimum lag length selection, we are supposed to comply with certain information criteria such as Akaike, Schwarz etc. As far as I know, either of them suggest the proper lag number that we should embed in our model, which they are implicitly supposed to be in serial order.

For instance, if we have a VAR(3) of $Y_t$ and $X_t$ variables, we are supposed to embed in the model $Y_{t-1}$, $Y_{t-2}$, $Y_{t-3}$ and $X_{t-1}$, $X_{t-2}$, $X_{t-3}$ respectively. And here is my question:

Why should the lagged values always be consecutive? Is it prohibitive (according to economic and statistic theory) for a model to include non serial lagged values such as that below?

E.g.,

$$ Y_t = a + bY_{t-1} + cY_{t-3} + dY_{t-5} + eX_{t-1} + fX_{t-3} +gX_{t-5} + e_t $$

You should post this under the answer, otherwise Christoph will not get notified of your comment. — Richard Hardy, Dec 14 '18 at 08:53
Thank you for the guidance Richard, since I'm a new member of this community. I still have so much to learn! — Logicseeker, Dec 14 '18 at 09:12

score 1 · Accepted Answer · answered Dec 13 '18 at 14:19

1

You are right that there is no firm theoretical reason for intermediate lags to always be included.

My interpretation of the tradition to nevertheless proceed in this fashion is that, even for a simple AR model with maximum lag $p_{\max}$, one would need to compare $2^{p_{\max}}$ models when combinations of intermediate lags need not enter the model, while we only need to compare $p_{\max}$ models when the search is only performed over the maximal lag.

Suppose you entertain lags at "business cycle frequencies" of a few years in quarterly data, so something like 20 lags. Then, $2^{20}=1048576$ need to be compared. While the computational cost of doing so may be less prohibitive nowadays, it still somewhat of a burden, and the benefits to finding models with missing intermediate lags may be limited.

Of course, there are even many more models to be compared in the case of a VAR.

answered Dec 13 '18 at 14:19

Christoph Hanck

25,948
3
57
106

1

In addition to the good points in the answer, if you increase the number of models in the comparison, the winner's curse becomes important: there is an increasing probability that the best model is the best in sample (but not in population) mainly due to chance. See Hansen ["A Winners Curse for Econometric Models: ..."](http://www.tse-fr.eu/sites/default/files/medias/stories/SEMIN_10_11/ECONOMETRIE/hansen.pdf) (2010). See also [this thread](https://stats.stackexchange.com/questions/211069/aic-model-selection-and-overfitting?noredirect=1). – Richard Hardy Dec 13 '18 at 14:58
Precise and explicit answer. Thank you very much Christoph! – Logicseeker Dec 14 '18 at 09:06
@Christoph Hanck I came across this today as I'm new to all of this. How does one account for seasonal lag like in a situation where you have daily observations but you want to test lag on the calendar quarter (3 months)? Isn't that essentially what OP is asking? Am I misinterpreting what you're both saying or is it true that measuring the effect of seasonal lag on data with a more precise sample rate very expensive? – Sam Dillard Jul 29 '20 at 23:29
It would probably be expensive if you indeed, as we discuss in this post, included all intermediate lags. But if you have good reasons to only include that seasonal lag, that saves those intermediate coefficients to be fitted (and/or a complex model building exercise). You could also google for seasonal arima models, which do something like that automatically. – Christoph Hanck Jul 30 '20 at 04:18

Clarifying lag number selection in AR,VAR, VECM etc. models

1 Answers1