6

Simulated data:

dput(test)
structure(list(xx = c(11.68, 11.29, 11.17, 10.41, 9.36, 9.52, 
8.67, 7.69, 8.36, 6.97, 7.05, 7.08, 6.62, 6.35, 5.96, 4.91, 7.25, 
8.66, 7.85, 9, 8.14, 6.99, 7.23, 6.16, 6.42, 6.6, 5.47, 4.85, 
5.12, 4.76, 4.72, 5.32, 5.04, 5.32, 4.83, 4.83, 4.95, 5.28, 5.53, 
6.01, 6.16, 6.08, 5.81, 5.3, 4.94, 5.24, 4.04, 3.79, 4.62, 3.96, 
3.78, 3.55, 4.06, 3.17, 3.32, 3.26, 2.98, 3.74, 3.51, 3.26, 3.58, 
3, 3.37, 3.83, 4.07, 4.5, 3.88, 3.95, 3.98, 4.66, 4.25, 3.94, 
2.67, 3.35, 3.03, 1.32, 1.51, 1.89, 1.89, 1.88, 2.03, 1.75, 1.58, 
1.9, 2.13, 1.86, 1.07, 0.99, 1.32, 1.04, 1.16, 1.2, 1.1, 1.35, 
1.42, 1.2, 1.23, 1.2, 1.17, 1.06, 0.48, 0.59, 0.54, 0.5, 0.52, 
0.55, 0.51, 0.87, 0.84, 1.12, 1.64), exogenous = c(1000812, 996428, 
992312, 983312, 970940, 971216, 972260, 978320, 976604, 977624, 
984456, 988460, 992740, 1002084, 1012104, 1016452, 1032688, 1050108, 
1064876, 1070644, 1079808, 1079192, 1082396, 1086852, 1088284, 
1094408, 1101852, 1112128, 1130888, 1142100, 1156644, 1167744, 
1182440, 1185032, 1194124, 1212376, 1234436, 1246896, 1267536, 
1288632, 1307456, 1323244, 1338256, 1344260, 1345544, 1347708, 
1345300, 1353236, 1373616, 1380512, 1392532, 1401380, 1411232, 
1408736, 1408912, 1422748, 1433548, 1449596, 1464716, 1474380, 
1481728, 1491148, 1509664, 1528212, 1535496, 1536604, 1537388, 
1546744, 1558392, 1573918, 1580083, 1581735, 1581352, 1587900, 
1603940, 1583744, 1544576, 1527264, 1533664, 1549808, 1569424, 
1576764, 1586548, 1601924, 1617568, 1622520, 1642276, 1650212, 
1652392, 1657760, 1662560, 1664068, 1678948, 1688500, 1703332, 
1721832, 1722728, 1739560, 1748676, 1758956, 1755852, 1750036, 
1760456, 1760356, 1773768, 1765508, 1785276, 1799056, 1814848, 
1840508, NA)), .Names = c("xx", "exogenous"), class = c("data.table", 
"data.frame"), row.names = c(NA, -111L), .internal.selfref = <pointer:   0x0000000000090788>)

I have a data definitely non-stationary, this works fine:

auto.arima(ts(data = test$xx))

Series: ts(data = test$xx) 
 ARIMA(2,1,2) with drift 

But then when I use an exogenous variable and the process is fitting ARIMA-errors, it does not differentiate:

auto.arima(ts(data = test$xx), xreg=test$exogenous, trace=TRUE)

Series: ts(data = test$xx) 
Regression with ARIMA(1,0,0) errors

I turned on trace and realized that it is not even considering differentiation:

ARIMA(2,0,2) with non-zero mean : Inf
ARIMA(0,0,0) with non-zero mean : 341.2597324
ARIMA(1,0,0) with non-zero mean : 209.7431147
ARIMA(0,0,1) with non-zero mean : 259.8316269
ARIMA(0,0,0) with zero mean     : 586.1168037
ARIMA(2,0,0) with non-zero mean : 211.9364634
ARIMA(1,0,1) with non-zero mean : 211.9364098
ARIMA(2,0,1) with non-zero mean : Inf
ARIMA(1,0,0) with zero mean     : Inf

exogenous variable is also non-stationary. What am I missing? for ARIMA-error should we only provide stationary data?

based on Rob.Hyndman, he mentions in the comments:

"This issue has been fixed. There is no need to do any differencing by hand. Just use xreg=xr and everything should work correctly."

soheil
  • 150
  • 10
  • Interesting question. Can you edit your post to include some simulated data? I don't seem to be able to make `auto.arima()` do differencing with `xreg`, either. Have you pinged @RobHyndman about this? It may well be deliberate; take a look at the section on nonstationary data [in this blog post of his](https://robjhyndman.com/hyndsight/arimax/). If you get an answer from him, please consider posting it here for future generations. – Stephan Kolassa Nov 24 '17 at 20:25
  • @StephanKolassa I added data sample. I also followed the actual code of *auto.arima*, I couldn't find the point when *d* becomes zero. I have read the post before, He comments that there is no need for covariates to be stationary. – soheil Nov 24 '17 at 20:45
  • @StephanKolassa how should I ping him? just referring to his stackexchange ID in the question? – soheil Nov 24 '17 at 20:52
  • Send him an email, [address here](https://robjhyndman.com/about/), and point him to this post. If he has time, there is a nonzero chance he'll answer. – Stephan Kolassa Nov 24 '17 at 20:54
  • I think this was discussed either in the help file or in a vignette, just cannot remember the reason given there. – Richard Hardy Nov 24 '17 at 20:58
  • 1
    If you examine the residuals of `lm(xx~exogenous)` you will note there is no indication of nonstationarity; the autocorrelation at a lag of 1 is about 0.4 and declines rapidly thereafter. As a hypothesis - it might be that the function doesn't consider differencing because the ACF / PACF after the regression are are nowhere near the nonstationary region, which would naturally save execution time. Also, note that auto.arima at one time, and perhaps today, applied an ARMA model to the residuals of the regression (https://robjhyndman.com/hyndsight/arimax/) and consequently would not difference. – jbowman Nov 24 '17 at 22:00
  • 1
    As already noted, the residuals are tested for non-stationarity, not the predictors. This is because the predictors may not be stochastic, so then the requirement is only that the residuals are non-stationary. If you know the predictor is stochastic and non-stationarity, then just specify the number of required differences yourself via the d argument. – Rob Hyndman Nov 25 '17 at 02:34

1 Answers1

3

There is a test inside the forecast function for whether the series should be differenced:

if (is.na(d)) {
    d <- ndiffs(dx, test = test, max.d = max.d)
    if (d > 0 & !is.null(xregg)) {
        diffxreg <- diff(diffxreg, differences = d, lag = 1)
        if (any(apply(diffxreg, 2, is.constant))) 
            d <- d - 1
    }
}

where d is the order of differencing specified in the function call (defaulting to NA.) Another test - nsdiffs - is applied for seasonal differencing. If the test does not indicate the presence of a unit root, models with differencing are not even considered, which, as one might imagine, can save considerable runtime.

With respect to the example in the OP - the forecast function runs the regression lm(xx~exogenous) and applies ARIMA modeling to the residuals. In the case of this example, the ACF / PACF plots make it clear that the residuals are stationary, at least to my eye.

To see that auto.arima can in fact consider differenced residuals, we construct the following example where $y$ is clearly nonstationary and, as $x$ and $y$ are independent, the residuals from the regression of $x$ on $y$ will also be nonstationary (unless a very low probability event occurs.)

> y <- rnorm(100, 1:100, 25)
> x <- rnorm(100)
> auto.arima(y, xreg=x, trace=TRUE)

 Regression with ARIMA(2,1,2) errors : Inf
 Regression with ARIMA(0,1,0) errors : 974.5948
 Regression with ARIMA(1,1,0) errors : 953.8159
... more models, removed to save space ...

 ARIMA(2,1,1)                    : 920.2894
 ARIMA(2,1,2)                    : 922.1489
 ARIMA(3,1,2)                    : 923.2468
 ARIMA(1,1,1)                    : 922.377

 Best model: Regression with ARIMA(2,1,1) errors 

Series: y 
Regression with ARIMA(2,1,1) errors 

EDIT: Update in response to a comment

I copied and pasted the data from the example above, and ran:

> length(test$xx)
[1] 111
> length(test$exogenous)
[1] 111
> ndiffs(residuals(lm(xx~exogenous)), max.d=2)
[1] 0

to confirm that the ndiffs function is in fact returning 0 for this data.

jbowman
  • 31,550
  • 8
  • 54
  • 107
  • You have a good point, but I ran the test, it gave me _d=1_, then it differentiates exogenous covariates and last condition is _FALSE_, so _d_ remains 1. I also put `stepwise=FALSE, approximation=FALSE,` same results. – soheil Nov 24 '17 at 22:58
  • When I ran through the code line by line, I got `d = 0` out of the test. Not sure what to say at this point. – jbowman Nov 24 '17 at 23:55
  • I was running on the exogenous, not the residuals. Anyway, your other point was correct. – soheil Nov 25 '17 at 06:59
  • Yes, the code is somewhat unclear. There are two key lines before the `ndiffs` call: `xx[j] – jbowman Nov 25 '17 at 16:07
  • @jbowman, very good point. Now I have an example where BIC is "significantly" (subjective assessment) smaller when `d and D = 1`, while all the unit root tests say `d != 1`. This is on `log(AirPassengers)` data where `xreg` is a trend. Any ideas? Not sure whether it is worth a new question – Davor Josipovic Dec 29 '17 at 08:11