6

I would like to conduct a forecast based on a time series ARIMA-model with multiple exogenous variables. My time series is monthly unemployment data (in percentage) during several years and my regressors are continuous values of viewership Wikipedia traffic data on several Wikipedia articles. Both, the time series and the regressors, have the same length.

How to choose the right regressors to include in the model? Using auto.arima and forecast functions from the "forecast" package in R, my first attempt was to order the regressors according to the best resulting MAE when using each one individually. So, I start by using only 1 regressor (the best MAE), then I add the second best regressor, etc. Nevertheless, this post suggests to choose regressors according to significance but this post by Rob Hyndman suggests using AIC.

How should I proceed? How do I accept/reject regressors?

ruthy_gg
  • 211
  • 1
  • 9
  • This is quite a frequent question, you could benefit from exploring the existing threads more. Of course, in case of conflicting advice, it is valid to ask for reassurance. – Richard Hardy Jul 28 '16 at 13:05
  • Thanks Richard Hardy, Im quite new to arima models and how this R package works with forecasts. I have found several threads. The one with more help is the one referred in my post. I just wanted to find some feedback with regard to my approach. – ruthy_gg Jul 28 '16 at 13:30
  • Understood. See also my comment under Stephan's answer. – Richard Hardy Jul 30 '16 at 16:33

1 Answers1

3

The gold standard in time series model selection is to use a holdout sample. Hold out the last few months of data, fit the different models (with different combinations of regressors) to the data before that, forecast into your holdout sample and pick the model with the lowest forecast error - MAE or MSE.

That said, I would expect readership numbers of different Wikipedia articles to be correlated, especially if used as a proxy for "has a lot of time on his hands". So you might want to look at dimension reduction techniques, like principal components analysis (PCA) or similar, to reduce your regressors to only the first few principal components. Fewer orthogonal regressors will yield a more stable model and probably better forecasts. (The problem is that interpretability suffers.)

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • That was fast :) But I also think we could pick a few good threads on variable/model selection in time series and keep referring to them instead of answering each question anew. Although each situation is slightly different, so your comment on Wikipedia readership is of course valuable. – Richard Hardy Jul 28 '16 at 13:06
  • @RichardHardy: I agree. Then again, searching for the canonical answer on time series model selection takes a slight bit longer than just firing off an answer, so I usually just write the answer... contributing to even more answers and to making it even harder the next time around. I really need to work on my self-control. – Stephan Kolassa Jul 28 '16 at 13:11
  • @StephanKolassa Thanks for your answer. I do choose the best regressor based on the MAE for the holdout data (testing set) and then add the second best regressor, etc based on MAE. Sometimes it improves the MAE when I add more regressors sometimes it doesn't. Regarding the PCA suggestion, do you perhaps have a thread or an example where PCA is used when using auto.arima and xregs? Thanks – ruthy_gg Jul 28 '16 at 13:41
  • 1
    @user844924, you can just look at pure PCA literature (without `auto.arima`) as you would be applying PCA *before* supplying the first few principal components as `xreg` in `auto.arima`. – Richard Hardy Jul 28 '16 at 14:10
  • Thank you!! I have been reading many posts for PCA. My question is if the order of the components is the same order of the input variables?? Component 1 represents which variable? The first column? – ruthy_gg Jul 28 '16 at 16:37
  • I just found a good thread with the answer I was looking for: http://stats.stackexchange.com/questions/87037/which-variables-explain-which-pca-components-and-vice-versa – ruthy_gg Jul 29 '16 at 09:44
  • Dear @StephanKolassa I have been playing with PCA using prcomp and then getting the most important k components to add in the xreg in auto.arima. My question now is, should I repeat the PCA process for both training and testing when I want to forecast using auto.arima? In which case the number of components may be different than the training set. Is this correct? – ruthy_gg Jul 29 '16 at 18:17
  • No, it wouldn't make sense to run two PCAs, i.e., transform your future regressors in a different way than the historical ones. Instead, run PCA on the training set, and feed the first few principal components into `auto.arima`. Then transform the regressors you want to use in forecasting *using the same transformation* and feed these into `forecast`. You may need to dig a little into `prcomp` and/or `princomp` to find the matrices that encode this transformation. – Stephan Kolassa Jul 29 '16 at 18:47
  • There is a disadvantage in using PCA only on the regressors as the information content considered is not tailored towards fitting (or predicting) $y$. That is, it can happen that the last principal component is the best predictor for $y$. However, I cannot tell how serious a problem this is in practice. Probably it varies from instance to instance. – Richard Hardy Jul 30 '16 at 16:32
  • @RichardHardy can you please elaborate a bit more about it? – ruthy_gg Jul 31 '16 at 16:59
  • @user844924, this can be found in the literature, but unfortunately I don't remember any references. Probably Hastie et al. "The Elements of Statistical Learning" has a passage about it, but probably not. The idea is, the first principal component extracts as much variation in the data (the $x$s) as possible. However, this variation may be orthogonal to $y$. The last principal component has the least variation, but it might happen to be highly correlated with $y$. So PCA is relevant within the set of $x$s you are considering but not necessarily with respect to $y$. – Richard Hardy Jul 31 '16 at 18:00
  • @StephanKolassa when you say "Then transform the regressors you want to use in forecasting using the same transformation" what do you mean by "using the same transformation"? There is a result of obtained with the PCA (prcomp$scores) where I can choose the number of components. Then for the testing, the values of the variables are different than the training ...how do I transform those values in the testing set if not using the PCA as well? I'm a bit confused, I'm sorry.. – ruthy_gg Jul 31 '16 at 18:03
  • @RichardHardy I see, thanks a lot!!! I understood it better :) – ruthy_gg Jul 31 '16 at 18:05
  • @user844924, regarding your last comment to Stephan, I can try to answer it. You have to use loadings obtained from the original PCA to construct the PCs for the updated/new dataset. So you run PCA once and reuse loadings when needed. – Richard Hardy Jul 31 '16 at 18:34
  • @RichardHardy thanks a lot for your answer. I have just one more question. Coming back to the unemployment time series and the viewership data time series of Wikipedia articles. The way I use to forecast unemployment using arima an 1 regressor was as follows: take 2 years of monthly unemployment rate , use 2 years of monthly viewership data for one specific article as regressor (xreg) and fit the time series for the training set. – ruthy_gg Jul 31 '16 at 19:17
  • @RichardHardy: Using that fit, test the model to predict unemployment rate for 5 new coming months using xreg as the viewership data for those 5 new months. In this case, using the PCA, for several Wikipedia articles/timeseries for the regressors ... should I use the PCA for all the time series (2 years + the 5 months) and use the resulting components of this one run for fitting my training set **and** testing my model? – ruthy_gg Jul 31 '16 at 19:17
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/43279/discussion-between-richard-hardy-and-user844924). – Richard Hardy Jul 31 '16 at 19:21
  • By the way, Partial Least Squares (PLS) could be a relevant keyword. It is a bit like PCA but specifically targeted towards Y. If you had just a regression model, it would probably be what you need. In regression with ARMA errors it is not as straightforward. – Richard Hardy Aug 02 '16 at 13:38