Autocorrelation in residuals of a regression model with ARIMA errors (example in Rob Hyndman's book) - Part 1

Question

I am a novice to time series forecasting and I need some help understanding something in Rob Hyndman's excellent Forecasting: Principles and Practice book (3rd edition). After fitting a regression model with ARIMA errors (section 10.3, figure 10.7 for the actual example I am referring to), the book says

"There is clear heteroskedasticity in the residuals, with higher variance in January and February, and lower variance in May. The model also has some significant autocorrelation in the residuals, and the histogram of the residuals shows long tails. All of these issues with the residuals may affect the coverage of the prediction intervals, but the point forecasts should still be ok."

(Bold is mine)

Given that it is important (or a requirement?) for a forecasting model to have residuals with zero mean, and no autocorrelation (mentioned in section 5.4 of the same book), it appears to me that the example in figure 10.7 violates the requirement that there should be no correlations in the residuals of a good forecasting model.

Why does the book then say the point forecasts should still be ok? Does this suggest autocorrelation in residuals will not affect point forecasts? What are the properties of residuals that affects point forecasts?

Part 2 is here

Michael · Accepted Answer · 2020-06-10T19:47:56.690

I would say the quoted statement is ambiguous and possibly misleading. Heteroskedasticity does not affect forecasting but serial correlation would make point forecast invalid.

In general, forecast implications of residual diagnostics are:

No heteroskedasticity and no serial correlation Forecast can be computed using consistent parameter estimates and forecast/prediction intervals have the right coverage probability.
Heteroskedastic but no serial correlation Forecast can be computed using consistent parameter estimates. Forecast/prediction intervals would have the right coverage probability if sample size is large or if robust standard error is used.
Serially correlated Parameter estimates are no longer consistent. Forecast and prediction intervals cannot be computed.

For example, take the simplest time series data generating process, the AR(1) model $$ x_t = \rho x_{t-1} + \epsilon_t, $$ and consider the following 3 cases.

Case 1: $\epsilon_t \stackrel{i.i.d.}{\sim} (0, \sigma^2)$

This is the ideal scenario. The residual from fitting the AR(1) model to a sample would not have serial correlation, heteroskedasticity, or thick tails, because the population error term $\epsilon_t$ does not.

The oracle one-period ahead forecast and mean-square forecast error (MSFE) are \begin{align} E[x_{t+1}|x_t] &= \rho x_t,\\ E[ (x_{t+1} - E[x_{t+1}|x_t])^2 ]&= \sigma^2. \end{align}

So to compute one-period ahead forecast based on a sample of size $T$, you simply replace $\rho$ by, say, the OLS/conditional MLE estimate $\hat{\rho}$: $$ x_{T+1 \vert T} = \hat{\rho} x_T. $$ Same for the forecast mean square error $$ \widehat{MSFE}^2 = \frac{1}{T} \hat{\sigma}^2 + \hat{\sigma}^2, $$ where $\hat{\sigma}^2$ is the usual sum of squared residuals divided by $T-1$. The 95% prediction interval is then $x_{T+1 \vert T} \pm 1.96 \times \widehat{MSFE}$. This coverage probability of this prediction interval approaches the nominal coverage probability of 95% in large sample.

($\widehat{MSFE}$ can be computed as follows: \begin{align} \widehat{MSFE}^2 &= E[ (x_{t+1} - \hat{\rho} x_t)^2] \\ &= E[(\hat{\rho} - \rho)^2 x_T^2] + \sigma^2 \\ &\approx \frac{1}{T} \hat{\sigma}^2 + \hat{\sigma}^2. \end{align} In comparison with the oracle MSFE, the first term accounts for estimation error $\hat{\rho} - \rho$. )

Case 2: $(\epsilon_t)$ is (conditionally) heteroskedastic but serially uncorrelated

(For example, $( \epsilon_t )$ could follow an ARCH process. The consistency of $\hat{\rho}$ holds beyond such parametric specifications.)

The residuals from fitting the AR(1) model to a sample would show heteroskedasticity but no serial correlation. The estimate $\hat{\rho}$ is still consistent, and the one-period ahead forecast is still $\hat{\rho} x_T$. A prediction interval of the form $\hat{\rho} x_T \pm \cdots$ would still be correctly centered.

For the mean square forecast error, $$ E[(\hat{\rho} - \rho)^2 x_T^2] \approx \frac{1}{T} \hat{\sigma}^2 $$ is no longer a good approximation. $\hat{\sigma}$ should be replaced by a heteroskedascitity-robust standard error. However, if $T$ is large, this term is negligible, and $$ \hat{\rho} x_T \pm 1.96 \times \hat{\sigma} $$ would still have asymptotic coverage probability of 95%.

Case 3: $(\epsilon_t)$ is serially correlated

(For example, $( \epsilon_t )$ could be itself AR(1).)

The residual from fitting the AR(1) model to a sample would have serial correlation. The estimate $\hat{\rho}$ is no longer consistent (you can check this via simple simulation) and $\hat{\rho} x_T$ is no longer a consistent estimator of $E[x_{T+1}|x_T]$.

The minimal condition required for $\hat{\rho}$ to be consistent is $\frac{1}{T} \sum_{t=1}^T E[x_t \epsilon_t] \rightarrow 0$. This would not be satisfied if $(\epsilon_t)$ has serial correlation.

Caveat: Best Forecast vs. Best Linear Forecast

Forecasting can be discussed in terms of the best forecast $E[x_{T+1}|x_T]$, or best linear forecast. The above discussion is in the context of the best forecast $E[x_{T+1}|x_T]$ (conditional mean of $x_{T+1}$ conditional on $x_T$).

In terms of the best linear forecast, the point forecast $\hat{\rho} x_T$ is still valid under Case 3. The difference is that while $\hat{\rho}$ no longer consistently estimates $\rho$, it still captures linear correlation between $x_{T}$ and $x_{T+1}$: $$ \hat{\rho} \stackrel{p}{\rightarrow} \frac{Cov(x_{t+1}, x_t)}{Var(x_t)} \, (\neq \rho). $$ The forecast interval $$ \hat{\rho} x_T \pm 1.96 \times \hat{\sigma}_{HAC} $$ would have the correct asymptotic coverage probability (with respect to the best linear forecast, not the best forecast) if $\hat{\sigma}^2_{HAC}$ is the heteroskedasticity autocorrelation robust (HAC) estimate of long-run variance computed from the residuals.

Why is the estimator of $\rho$ inconsistent in case 3? E.g. in a regression (not autoregression) with autocorrelated errors, the OLS estimator of the slope coefficient is consistent, though inefficient. What happens so that it becomes inconsistent in an autoregression? Also, in case 2 you may wish to note that the coverage is correct only unconditionally but not conditionally. Also, mentioning efficiency (including it at least briefly in the considerations) could be relevant, too. — Richard Hardy, Jun 11 '20 at 05:48
"...in a regression (not autoregression) with autocorrelated errors, the OLS estimator of the slope coefficient is consistent..."---yes, in a regression where exogeneity of regressors holds. More precisely, consistency would hold if $\frac{1}{T}\sum_1^T E[x_t \epsilon_t] \rightarrow 0$ where $x_t$ is the regressors (this condition is weaker than exogeneity). In an AR regression with LDV, exogeneity does not hold and consistency does not follow. Simulation would tell you consistency does not hold in this case. — Michael, Jun 11 '20 at 12:37
"in case 2,... the coverage is correct only unconditionally but not conditionally"---yes, true. This formulation of MSFE seems customary. One can also talk about conditioning on $X_T$, in which case the unconditional variance of $X_T$ does not enter. I have not seen the conditional formulation in too many places; would you have a reference? — Michael, Jun 11 '20 at 12:41
Thank you for the clarifications! What is LDV? I am used to seeing this in the context of *limited dependent variables*. (Update: I have never seen LDV used for *lagged dependent variables* before, perhaps since there the term *autoregressive*.) No, I do not have any concrete reference. The lack of consistency caused by autocorrelation in AR models sounds like an important problem. I was not aware of it before. — Richard Hardy, Jun 11 '20 at 12:42
I have never seen LDV used for *lagged dependent variables* before, perhaps since there is the term *autoregressive*. The lack of consistency caused by autocorrelation in AR models sounds like an important problem. I was not aware of it before. Glad to learn something new! — Richard Hardy, Jun 11 '20 at 12:47
"efficiency...could be relevant"---I suppose efficiency is relevant in that estimation error plays a role but it seems negligible to me since the estimation error term in MSFE is of order 1/T. On the other hand, the long-run variance of the residuals does not vanish and is the dominant term. The robust (HAC) calculation is basically the same for residual long-run variance and for estimation error long run variance, though. — Michael, Jun 11 '20 at 12:53
@RichardHardy A tangential point of some interest, regarding lack of consistency due to serial correlation: The same bias shows up in the non-stationary setting. Regression-based unit root test (Dickey-Fuller) loses power due to this bias. Couple standard ways to address this are: 1. Augmenting by lags of first differences (Augmented Dickey-Fuller, but the DGP is now really a triangular array, not a time series). 2. Applying HAC to residuals (Phillips-Perron). — Michael, Jun 11 '20 at 16:41
@Michael, thank you so much. A natural follow on question (and I am new to the platform so I am not sure if I should post this as a new question) is - what does the forecaster do in a case when there's correlation in the residuals of an ARIMA model that's used to model the errors from a regression model? Does this mean the forecasting approach - regression model with ARIMA errors - is not suitable and cannot/shouldn't be used or what steps can be taken to produce good points forecasts if the forecaster must necessarily use the regression model with ARIMA errors approach? — Newwone, Jun 11 '20 at 20:02
I have added new questions to my comment to Michael. Not sure whether I should post as a new question, add it to the original question here, or just leave in the comment to Michael... Please pardon me, I am new to this platform. — Newwone, Jun 11 '20 at 20:04
@Newwone As pointed out by RichardHardy, in a time series regression with no lagged dependent variables, as long as the regressors are exogenous, estimates are consistent. You would be fine with using those estimates for compute forecast and HAC standard errors for forecast error. (You should probably post this as a separate question, as the context is somewhat different.) — Michael, Jun 11 '20 at 20:08
@Michael, Ok, I will post a new question. Just one last one on this one, sorry. Now I am confused again. In the example I pointed out in Rob Hyndman's book (section 10.3, figure 10.7), his linear regression has no LDVs, and the regressors are exogenous - does that mean his estimates are consistent and the points forecasts should still be ok as he said in the book? — Newwone, Jun 11 '20 at 20:19
@Newwone Yes, provided those assumptions hold. Note that "no LDV" is a choice you make. In the quoted example, "personal consumption" (Personal Consumption Expenditure, is the name of the data series recorded by Federal Reserve) is typically highly persistent. Not including its lagged value is likely problematic. Similarly, exogeneity is something you assume and can presumably justify by empirical arguments. A missing LDV would violate exogeneity. — Michael, Jun 11 '20 at 20:27

Aksakal · Answer 2 · 2020-06-11T22:23:53.707

The short answer is that, usually, autocorrelation does not impact the estimates of coefficients, but impacts the variances. That's why he's saying that point forecasts will not change, but confidences will. Also, in time series regression the residuals are almost always correlated.

In other words in a model $y_t=X_t\beta+\varepsilon_t$, where $\varepsilon_t$ is ARIMA, if you ignore autocorrelation in $\varepsilon_t$, then your $\hat\beta$ are still Ok, but their p-values and variances $\hat\sigma^2_\beta$ can be messed up. Thus the quip on the point forecast $\hat y_{t+h}=X_{t+h}\hat\beta$ being Ok. He was careful to say "should still be ok," making it not absolute statement, but more of a practical advice, with which I agree.

Non zero mean is difficult to detect, because out of regression by construction the residuals will come with zero (unconditional) mean. Non zero mean error is an issue, of course, but it's more subtle than many think. Here's how it's expressed in conditional terms: $E[\varepsilon|X]=0$. One situation that violates this condition is when errors' mean varies with predictors. For instance, you overestimate for large values of predicted $\hat y$ and underestimates for small values. This is why it's recommended to plot residuals vs predicted chart.

"Usually, autocorrelation does not impact the estimates of coefficients"---clearly not true when lagged dependent variables present, which is almost always. The simplest such case is AR(1) with AR(1) errors. — Michael, Jun 10 '20 at 21:24
The distinction between estimation and forecast is that, if one is concerned with the best *linear* forecast, then consistency of parameter estimates is not relevant. Computing the best forecast (conditional mean), on the other hand, requires consistent estimates. — Michael, Jun 10 '20 at 21:28
@Michael, OP is talking about regression with arima errors, i.e. $y_t=X_t\beta+\varepsilon_t$ with $\varepsilon_t\sim ARIMA$. It's very distinct from AR(1) model. — Aksakal, Jun 11 '20 at 22:18

Autocorrelation in residuals of a regression model with ARIMA errors (example in Rob Hyndman's book) - Part 1

2 Answers2

Linked