Variable selection in time series data

Question

I have an econometric dataset, 50 observations of 350 variables. They include things like GDP, unemployment, interest rates and their transformation such as YoY change, log transform, first differences etc. I need to build an arimax model, and first I need to select variables.

350 univariate regressions against the response were run, and the 20 best predictor variables based on R-square were chosen.

My question is: is univariate regression a good way to screen predictor variables? I have read that variables perform differently in the when combined with others than alone. Is there anything I need to check about my data before pruning my set of predictor variables this way? ( My response variable is a log return (whose mean is close to zero), the transformed predictor variables vary in scale: some in log scale , others range in 100,000s. I expect most of the transformed ones to be stationary. )

Also, I tried running a Lasso selection in SAS with all the variables, and Lasso terminated in just 1 step selecting one variable only. There was a message whichi said that only 5 records out of the 50 observations were used by Lasso. Could this be due to missing values? My data doesn't have too many missings, so I was surprised. Maybe its because there are far many more predictors than observations (350 vs 50 ).

Thanks for any advice on how to proceed.

@Hamed, would you mind taking a look at [this question](http://stats.stackexchange.com/questions/152202/regularization-for-arima-models)? It seems related to the linked research paper. — Richard Hardy, Aug 06 '15 at 11:26
*350 univariate regressions against the response were run, and the 20 best predictor variables based on R-square were chosen.* No, that does not sound sensible. Examining bivariate relationships is not enough to decide which explanatory variables are the most relevant. LASSO is much more suitable (as long as you deal with the error SAS is giving there). However, there remains the problem that some variables may have delayed response, as mentioned by @IrishStat. Perhaps try LASSO on the variables and their first and second lags (including more lags would perhaps be an overkill)? — Richard Hardy, Aug 06 '15 at 11:33
Thanks for the answers. As I mentioned, Lasso only uses 5 out of the 52 records. I don't have that may missing values...so I'm confused. I am actually getting apparently good p-values with the models I built using the method above... but are these p-values suspect in any way? These models are to be used for prediction purposes primarily. I have also posted here on this problem.http://stats.stackexchange.com/questions/164999/variable-selection-for-arimax-model — user2450223, Aug 06 '15 at 16:21
@Richard Hardy Also, why would examining bivariate relations not be a good starting point? I also further used variable clustering as mentioned here http://stats.stackexchange.com/questions/164999/variable-selection-for-arimax-model — user2450223, Aug 06 '15 at 16:31
Suppose the true model is $y=\beta_0+\beta_1 x_1+\beta_2 x_2+\varepsilon$ but you fit $y=\beta_0+\tilde \beta_1 x_1+\tilde \varepsilon_1$ and $y=\beta_0+\tilde \beta_2 x_2+\tilde \varepsilon_2$. The $R^2$s, the coefficients and their statistical significances in the bivariate models can be very far away from those of the full model. You need not expect that the bivariate regressions will be informative of the full model. This result has been demonstrated in other posts here on Cross Validated. — Richard Hardy, Aug 06 '15 at 17:08
An example with omitted intercept (which is simpler than the case with intercept): $y=(1,1,1,1,\dotsb,1,1), x_1=(0,1,0,1,\dotsb,0,1), x_2=(1,0,1,0,\dotsb,1,0)$. The bivariate regressions would give terrible fit while the full model would give perfect fit. If you trusted bivariate regressions, you would dismiss both $x_1$ and $x_2$ as potential candidates of relevant regressors. However, that would be a very bad decision in light of what the full model gives you. — Richard Hardy, Aug 06 '15 at 17:14

score 0 · Accepted Answer · answered Aug 05 '15 at 15:54

0

Your approach fails to consider various forms of delayed response to one or more of the candidate predictors. When determining the appropriate sub-set of variables you need to pre-whiten the variables and form impulse response weights to identify important lags of each of the candidates while taking into account possible variables like pulses/level shifts etc.. We refer to this problem as kitchen-sink modelling as you are throwing everything into the mix except the kitchen sink.

answered Aug 05 '15 at 15:54

IrishStat

27,906
5
29
55

Thanks for your reply. I am new to time series modeling and not sure what a delayed response is... do you mean the lags of the predictor variables? I added a variable clustering step too to finalize the variable selection. I have posted here too about this problem http://stats.stackexchange.com/questions/164999/variable-selection-for-arimax-model – user2450223 Aug 06 '15 at 16:33
Yes that is what i mean. Consider candidate X having a 3 period lag effect on Y. Your scheme would not necessarily identify candidate X . – IrishStat Aug 06 '15 at 18:27
The original 350 include lags as well as other transformationso of the raw variables. Could you suggest any other ways to choose features for this problem? Thanks – user2450223 Aug 07 '15 at 01:16
What you are trying to do is to perform a list based solution by preparing (in advance) various transformations. A better approach ( which is what I programmed in AUTOBOX) is to take each distinct (original candidate) X variable and to assess it's importance to predict/model the Y variable taking into account potential contemporary and lag effects while incorporating any Gaussian violations such as pulses/level shifts/seasonal pulses and/or local time trends. Then rank these predictors and then select an estimable subset based upon the length of your data. – IrishStat Aug 07 '15 at 11:28

Variable selection in time series data

1 Answers1

Linked