I have a data set with 200 predictors and 700 observations. It is a regular time series, so 700 days in my case.
I want to experiment with lagged variables, which I will create manually and save as new columns (predictors) in my data.table. This is because I am going to use boosting methods for the modelling, nothing like ARIMA.
Boosting works sequentially and furthermore, I am using component-wise boosting, which (loosely) replies "no" to the question: "isn't adding more and more lags and so variables damaging to the model accuracy?"
I want to test which numbers of lags is optimal, or rather at which point it doesn't make any difference to the model.
I am going to be running the whole thing through caret
, using the train
function to cross-validate and optimise the main parameters I need to minimise my loss function.
My question is: Can I just run the model with, say 5 lags. something like [leaving out coefs etc.]:
y(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4) + x(t-5)
Will this cover lags as a factor in the model optimally? Or would there be significant differences to running the lags incrementally, that is in five stages:
y0(t) ~ x(t)
y1(t) ~ x(t) + x(t-1)
y2(t) ~ x(t) + x(t-1) + x(t-2)
y3(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3)
y4(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4)
y5(t) ~ x(t) + x(t-1) + x(t-2) + x(t-3) + x(t-4) + x(t-5)
What might I google to read more about this idea? I understand there are tests like the Granger causality test, however I am not sure how well that performs with so many covariates...?