How can I properly do variable and model selection using auto.arima function?

Question

I'm fitting ARIMA models to two different data sets (different metrics of fish abundance and distribution from two different sites) to see which model orders and covariates best describe the data from each site and would be good to forecast.

To do so, I'm using the auto.arima function. I'm running auto.arima with different combinations of covariates and looking at the AICc. I fixed d=1 so I know the input data is always the same, thus enabling to compare models using AICc.

The orders of the ARIMA output are typically different depending on the covariate(s) I include. Am I doing this right? Should I just fix the orders p, d and q of the ARIMA and then evaluate the different combinations of covariates.

Or am I totally wrong and I should just run auto.arima() with all the possible covariates in xreg and see what comes out? I tried this and I got a coefficient for each variable but I'm not sure if that means all variables are important or if auto.arima is forcing the variables to be included in the final model.

score 5 · Accepted Answer · answered May 10 '17 at 08:03

auto.arima()'s help page does not say whether or how xreg regressors are selected. I strongly suspect that xreg regressors are always included, whether or not they improve AICc or fit or are significant or not. Running

auto.arima(rnorm(100),xreg=rnorm(100))

multiple times confirms this - the regressor is always included, even if it is, like here, utterly unrelated to the time series.

It's not surprising that your ARIMA model changes if you include different regressors. After all, auto.arima() fits a regression with ARIMA errors (note: this is not an ARIMAX model!), so if you include different regressors, you feed a different residual time series to the ARIMA model. So getting different ARIMA orders makes perfect sense.

I'd propose that you actually regress your time series on the regressors, then look at the time series of residuals - this is what is fitted using ARIMA. Look also at ACF/PACF plots. Do the fitted ARIMA orders make sense for the residual time series? Do they make biological sense? For instance, if your model is well-specified, there may not be a good reason for moving average terms.

So selecting ARIMA orders based on AICc makes sense, but you will need to do something else to select your regressors. This will depend on what you are actually interested in. If you are interested in inference, then you should have prespecified your model and not be fitting different models, anyway. Stepwise methods for selecting covariates, whether based on AIC or significance, will render p-values invalid.

If you have enough data and are interested in prediction, you could do a forecasting test. Hold out the last couple of data points. Run auto.arima() on the rest, using your subset of regressors. Then forecast into the holdout sample. Do this for different choices of regressor sets, and check which one gives the lowest mean squared error. It is said that "the proof of the pudding is in the eating", and I firmly believe that "the proof of the model is in the prediction".

`auto.arima` certainly uses all the regressors supplied in `xreg`, there is no selection within `xreg` built into `auto.arima` -- I think you can state that unconditionally and shorten the first six lines. Regarding the last paragraph, some say that sample splitting for model selection is an inefficient alternative to selection based on full sample via information criteria. But perhaps that relies on assumptions of a constant data generating process over time, and in practice I am sympathetic to the idea of cross validation for model selection. — Richard Hardy, May 10 '17 at 09:03
@RichardHardy: thanks! I'd honestly rather leave the first paragraph as it is - it seems to me like it does add information, especially since this info does not seem to be spelled out in the help page. Regarding IC vs. holdout/crossvalidation - both have pros and cons, and ICs are typically only known to work *asymptotically*, and if we have "lots of data" in a time series context, I'd in fact worry that the DGP has changed, so I prefer holdouts. Plus, it's easier to understand. — Stephan Kolassa, May 10 '17 at 09:50
@StephanKolassa: thanks! I'll try: (1) using regression models (or maybe GAMs?) and select the important covariates by looking at their p-value and then plot the residuals to see if there is some autocorrelation remaining that I could explain using an ARMA structure - if that's the case I'll run auto.arima() using the residual time series to come up with a complete ARIMA-reg model that best describes my dataset. (2) doing what you suggested in your last paragraph. For (2) I have 362 datapoints, what do you think is a proper number of datapoints to be hold out? Does (1) sound reasonable to you? — Sil, May 10 '17 at 20:53
(1) is fine if your goal is prediction. As I write, if you are interested in inference, you should not be selecting your model based on p-values, because this data-picking will bias your *final* p-values downward. (2) I usually go with anything between 10% and 25% of the data. It also depends on what time scale your data are on, since I'd expect your data to exhibit seasonality - it would be good to have at least two years in the training and if possible one year in the test data. — Stephan Kolassa, May 11 '17 at 07:29

How can I properly do variable and model selection using auto.arima function?

1 Answers1