Modelling a time-series with lags

Question

I have a data set with 200 predictors and 700 observations. It is a regular time series, so 700 days in my case.

I want to experiment with lagged variables, which I will create manually and save as new columns (predictors) in my data.table. This is because I am going to use boosting methods for the modelling, nothing like ARIMA.

Boosting works sequentially and furthermore, I am using component-wise boosting, which (loosely) replies "no" to the question: "isn't adding more and more lags and so variables damaging to the model accuracy?"

I want to test which numbers of lags is optimal, or rather at which point it doesn't make any difference to the model. I am going to be running the whole thing through caret, using the train function to cross-validate and optimise the main parameters I need to minimise my loss function.

My question is: Can I just run the model with, say 5 lags. something like [leaving out coefs etc.]:

y(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4) + x(t-5)

Will this cover lags as a factor in the model optimally? Or would there be significant differences to running the lags incrementally, that is in five stages:

y0(t) ~ x(t)
y1(t) ~ x(t) + x(t-1)
y2(t) ~ x(t) + x(t-1) + x(t-2)
y3(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3)
y4(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4)
y5(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4) + x(t-5)

What might I google to read more about this idea? I understand there are tests like the Granger causality test, however I am not sure how well that performs with so many covariates...?

Tim · Answer 1 · 2015-12-10T18:06:32.403

0

The problem that you are describing is called autocorrelation. You can actually check the level of autocorrelation for any number of lags using autocorrelation function as described here. The model that you are describing is called autoregressive model (the AR part of ARMA). The simplest case is model for lag $h = 1$, the AR(1) model is

$$ y_t = \phi y_{t-1} + \varepsilon_t $$

AR model for $p$ lags, or AR(p), is

$$ y_t = \sum_{h=1}^p \phi_h y_{t-h} + \varepsilon_t $$

Notice that if you do it consistently, it does not really make much difference if you use $+1$ or $-1$ lag because they are symmetric. You can estimate such model using standard regression and such estimate should be close to AR, but most statistical software enables you to estimate it using functions designed for such models (e.g. ar in R).

edited Dec 10 '15 at 18:06

answered Dec 10 '15 at 09:00

Tim

108,699
20
212
390

I understand what autocorrelation is, as well as the multivariate euivalent (in my eyes) that is multicollinearity, however I would like to knowif there is a known methodology to test for this among wide data sets. As in my example, I want to know (by some 'established' criterion, if possible) which lags are useful and which not. If I compute pair-wise corelation for each variable and all lags (that 160 * 5 Lags = 320k possible pairs!), can I then say any value above `X` means it isn't good to use that variable-lag? Thanks for the answer, apologies for my slow response! – n1k31t4 Dec 18 '15 at 14:29
@DexterMorgan check the second link about autocorrelation that I provided - does it help? Unfortunately there is no clear cutting points for autocorrelation, it depends on your data ()generally such cutting points are misleading, cf. http://stats.stackexchange.com/questions/132536/how-to-choose-a-confidence-level/132538#132538. – Tim Dec 18 '15 at 14:37

Modelling a time-series with lags

1 Answers1