What does the forecaster do when there's correlation in the residuals of an ARIMA model that's used to model the errors from a regression model? Does this mean the forecasting approach - regression model with ARIMA errors - is not suitable and cannot/shouldn't be used or what steps can be taken to produce good points forecasts if the forecaster must necessarily use the regression model with ARIMA errors approach?
2 Answers
There is a difference between forecasting into the future (predicting $y_{t+1}$ based on $y_t$) and contemporaneous prediction (predicting $y_t$ based on $x_t$).
As discussed in the linked question, forecasting into the future necessarily involved lagged dependent variables in the regression. In this case, serial correlation in the residuals indicates serial correlation in error term. This would be problematic.
For contemporaneous prediction in a time series regression with no lagged dependent variables, valid predictions and prediction intervals can be computed under very general serial correlation and heteroskedascity conditions for the error term, under the key exogeneity assumption.
Empirically, as long as the regressors are exogenous, estimates are consistent, and gives consistent predicted value. Prediction errors can be computed by applying HAC procedure to the residuals.
Take the simplest example, $$ y_t = \beta x_t + \epsilon_t. $$ As long as exogeneity holds, i.e. $E[x_t \epsilon_t] = 0$, or even under the weaker condition that it holds "in the long run" $$ \lim_{T \rightarrow \infty}\frac{1}{T} \sum_{t=1}^T E[x_t \epsilon_t] = 0 $$ the regression estimate $\hat{\beta}$ is consistent, and $\hat{\beta} x_t$ is a consistent predictor of $y_t$. In the context of prediction, exogeneity is customarily strengthened to $E[\epsilon_t|x_t] = 0$. So the best prediction is $E[y_t|x_t] = \beta x_t$.
The population prediction error would just be the long-run variance of $\epsilon_t$. The corresponding sample quantity can be computed by applying a HAC computation to the residuals.
(One can plug in/assume/forecast future values of $x_{T+2}$ and predict $y_{T+2}$, but this is empirical practice.)
Further Comments
Both no lagged dependent variable and exogeneity are assumptions. They cannot be verified statistically and their validity rests on empirical justifications.
Exogeneity $E[x_t \epsilon_t]$ is by definition a statement about what is not observed ${\epsilon_t}$, therefore cannot be tested statistically. You have to justify empirically that everything you do not observe is uncorrelated with the regressor $x_t$. Serial correlation and heteroskedasticity in the residuals are not a problem only if exogeneity holds.
For example, if the $y_t$ depends on its lagged value $y_{t-1}$ but $y_{t-1}$ is omitted from the regression, then exogeneity would not hold. In this case, there would be serial correlation and heteroskedasticity in the residuals Therefore, just like exogeneity, having no lagged dependent variables in the model is a choice. It implies that you have made assumption that $y_t$ does not depend on its lagged value, which then allows you to conclude non-whiteness of residuals is OK.
For example, suppose the true model is $$ y_t = \phi y_{t-1} + \beta x_t + \epsilon_t, $$ and you fit the model $$ y_t = \beta x_t + \epsilon_t. $$ If you mistakenly assumed exogeneity, you will conclude that the serial correlation you observe in the residuals is not due to omitted lagged dependent variable (LDV), and mistakenly conclude $\hat{\beta}$, and corresponding predicted value, is consistent.
Data series from these models are observationally indistinguishable. Is the serial correlation in the residuals due to autoregression of the dependent variable or due to serial correlation in an exogenous error term? There's no statistical test that distinguishes the two cases.
Imposing a parametric ARMA structure on $(\epsilon_t)$ would not fix this problem.
(In the quoted example involving electricity demand and temperature, the model could well be correctly specified with no lagged dependent variables. I don't know nearly enough about the electricity market to say either way.)
Caveat
All this is relevant only if you care about the best prediction $E[y_t|x_t] = \beta x_t$. If you're only interested in the best linear prediction, go ahead, run the regression and use $\hat{\beta} x_t$. In this case, the bias in $\hat{\beta}$ is not a concern, since you don't really care about estimating the "true model". The OLS estimate, by construction, consistently estimates the linear correlation between $x$ and $y$.
In situations where you believe lagged variables play a role, they should certainly be included. Serial correlation in the residuals may suggest relevant lagged variables are being omitted, which leads to loss of predictive power.
Fine print in response to comments:
Best prediction of $y_t$ based on $x_t$ is $E[y_t | x_t]$. It is "the function $f(x_t)$ of $x_t$ that minimizes $E[( f(x_t)- x_t )^2]$, informally.
Best linear prediction of $y_t$ based on $x_t$ is $\frac{Cov(x_t, y_t)}{Var(x_t)} x_t$. It is "the linear function $f(x_t)$ of $x_t$ that minimizes $E[( f(x_t)- x_t )^2]$. By construction, regression estimate $\hat{\beta}$ will "always" consistently estimate $\frac{Cov(x_t, y_t)}{Var(x_t)}$.

- 2,853
- 10
- 15
-
Thank you. Couple of questions, please - Re 1) what are the practical ways to empirically justify that exogeneity holds? Re 2) how can one test/justify whether or not to include lagged dependent variables in the regression model? And one other question - I have never heard of best linear prediction. How can one know whether one needs the best linear prediction or the best prediction? Let me note that the context is about predicting into the future. – Newwone Jun 11 '20 at 22:37
-
"...what are the practical ways to empirically justify that exogeneity..."---this is an empirical question. Short answer is it depends on your knowledge of the empirical context (econometricians are much better trained in this regard than statisticians, so the introductory econometrics literature may be good references). – Michael Jun 11 '20 at 22:42
-
"...how can one test/justify whether or not to include lagged dependent variables in the regression model..."---similar answer as previous, and you can also first fit univariate AR models for the dependent variable. E.g. does electricity demand this month depend on that of previous months? – Michael Jun 11 '20 at 22:43
-
" I have never heard of best linear prediction. How can one know whether one needs the best linear prediction or the best prediction?"---most intro time series texts would contain the two definitions. The difference may well be academic. In industry practice, go ahead and run a regression and do a simple pseudo out of sample forecast exercise. It might well suffice for your purpose. – Michael Jun 11 '20 at 22:48
-
@Michael Do you have a good reference for this? Also, I'm interested in the production error of the sum of the n-step forecast. I can estimate the long-run HAC variance but I also have to estimate the auto-correlation to work out the sum and that doesn't seem to provide proper coverage? It's likely that the $E[x_t,\varepsilon_t]=0$ requirement is violated though. – David Waterworth Jun 11 '20 at 22:51
-
@DavidWaterworth A reference is Brockwell and Davis, Intro to Time Series and Forecasting. – Michael Jun 11 '20 at 23:00
-
@Michael, thank you. I will have a look at the book again. Just one question for now - are you talking about the regression model, or the ARIMA model used to model the errors of the regression model? And just to confirm I am understanding you correctly, is this correct: exogeneity is the requirement for estimates to be consistent and produce consistent predictions - in the presence of correlation in the residuals of an ARIMA model used to model the errors of a regression model. Missing LDV is just an example of a way to violate exogeneity. – Newwone Jun 11 '20 at 23:39
-
1And why would an ARMA not fix the issue if the ARMA is able to model the regression errors properly? For example, if the ARMA residuals are not correlated... – Newwone Jun 11 '20 at 23:39
-
"And why would an ARMA not fix the issue if the ARMA is able to model the regression errors properly? For example, if the ARMA residuals are not correlated..."---good question. Because (in the example from answer) you just fitted ARMA to (something like) $(\phi - \mbox{bias}) y_{t - 1} + \epsilon_t$ while mistakenly believing you fitted ARMA to $\epsilon_t$, which is fine. The residuals would be white. But your estimate $\hat{\beta}$ is still biased. – Michael Jun 11 '20 at 23:48
-
"...are you talking about the [auto]regression model, or the ARIMA model used to model the errors of the regression model..."---the key point is that data from these models can appear very similar. Is the serial correlation in the residuals due to autoregression/LDV or due to serial correlation in an exogenous error term? You don't really know---there's no statistical test that tells you which one is the true model. They are observationally similar. Therefore you have to be able to justify the exogeneity assumption empirically. – Michael Jun 11 '20 at 23:54
-
@Michael, thanks. So is the solution to the problem - if one is interested in the best prediction, not best linear prediction - the following: 1) ensure exogenous regressors (to be honest, I'm not 100% sure what this means particularly when we can't test/verify exogeneity) and 2) ensure lagged dependent variables are included in the regression model when appropriate? – Newwone Jun 11 '20 at 23:55
-
@Michael, studying your recent comments very closely, it appears perhaps the initial challenge to solve is specifying a regression model that will generate a non-biased beta hat. – Newwone Jun 11 '20 at 23:59
-
The additional information in beta hat is that it quantifies the causal impact of x on y. Best forecast captures this information while the best linear forecast is only concerned with correlation between x and y. – Michael Jun 12 '20 at 00:03
-
@Michael, thank you very much for helping me with this. I will go back and study / research some more. Just one last thing - can you please explain/describe exogeneity in the context you have been using it in plain English, and then link how missing LDV violates it also in plain English? I think that might be the missing link for me to see the amazing picture you have been kind enough to give me – Newwone Jun 12 '20 at 00:19
-
@Newwone "...So is the solution to the problem - if one is interested in the best prediction..."---yes, notice your step 1) already entails step 2), which is checking for lags of the dependent variable. Exogeneity example---if electricity demand this month depends on previous month in someway, then the model in the example quoted by you would be mis-specified and exogeneity does not hold. Why this should/should not be the case---you'd have to ask someone with expertise in that industry/market. – Michael Jun 12 '20 at 00:23
-
@Michael, thanks! I'm very happy with your response. So we are back to exogeneity (i.e. "exogenous regressors"). Would love to see the description/explanation in plain English – Newwone Jun 12 '20 at 00:26
-
@Newwone Empirically speaking, exogeneity holds if everything you don't observe/include in the regression has no effect on the dependent variable. – Michael Jun 12 '20 at 00:41
-
As an aside, there are several factors other than temperature which impact electricity demand - other climatic variables (wind speed, humidity, rain, solar) plus numerous behavioural factors (i.e. time of day and week). I would argue that none of the actual factors which impact demand are themselves influenced by past demand (i.e. past electricity demand doesn't change the weather or when we go to work) - although one could argue high electricity demand in the past might produce high prices which results in more efficiency measures - so perhaps lagged price should be added – David Waterworth Jun 12 '20 at 02:37
-
Regarding *For contemporaneous prediction in a time series regression with no lagged dependent variables, valid predictions and prediction intervals can be computed under <...> the key exogeneity assumption. Empirically, as long as the regressors are exogenous, estimates are consistent, and gives consistent predicted value.* Endogeneity and exogeneity are causal rather than probabilistic terms. I wonder how problematic endogeneity is in prediction. What about the notion of [predictive consistency](https://stats.stackexchange.com/questions/265739)? Does it fail in a time series setting? (+1) – Richard Hardy Jun 12 '20 at 05:14
-
Regarding *Both no lagged dependent variable and exogeneity are assumptions. They cannot be verified statistically and their validity rests on empirical justifications*, I have a feeling the terms *statistically* and *empirically* mean the same or something very similar. Instead of *empirically*, would it not be more logical to say *theoretically* or *by subject-matter considerations*? Also, I still think *AR term* would be more common than *LDV*; I have taken the liberty of spelling out the latter in your answer. – Richard Hardy Jun 12 '20 at 05:18
-
I posted my first comment before reading the last paragraph. I think it gets close to addressing predictive consistency (there is a sentence about it), though I would love to see a slight elaboration in addition to what there already is. Regarding *All this is relevant only if you care about the best prediction $E[y_t|x_t] = \beta x_t$. If you're only interested in the best linear prediction, go ahead and run the regression.* In the first sentence, you define best prediction as being linear. In the second sentence, you seemingly contrast it to... best linear prediction. I am confused. – Richard Hardy Jun 12 '20 at 05:26
-
OP wasn't talking about using the lagged Y, at least initially, and Hyndeman's text also is not about this case. regression with arima errors is simply $y_t=X_t\beta+\varepsilon_t$, where $\varepsilon_t$ is ARIMA process (no constant). Hence, although it is a valuable discussion, the lagged variable issues are irrelevant to OP's question – Aksakal Jun 12 '20 at 18:12
-
1@Aksakal Comment misses the point. You can't blindly assumed the true data generating process is $y_t = \beta x_t + \epsilon_t$, and therefore serial correlation in the residuals is due to serial correlation of a exogenous $\epsilon_t$. The serial correlation in the residuals could be due to lagged $y_t$ being omitted and you have fitted the wrong model, resulting in inconsistent estimate of predicted value. These two models are observationally indistinguishable. Assuming blindly that LDV's play no role is a mistake (which one is free to make, admittedly). – Michael Jun 12 '20 at 18:22
-
1@Aksakal Even in the quoted example involving demand of electricity, the serial correlation in the residuals could well be due to lagged prices being omitted (omitted lagged regressors causes the same mis-specification problem). This particular type of model is actually econometrics 101. Every undergrad would know to include prices and its lags when estimating demand function in the time series context. Comment by David Waterworth above already pointed this out. – Michael Jun 12 '20 at 18:28
-
@Michael these are all minor issues compared to getting the predictors right, or getting predictors at all. Yes, it's cool to have the demand model regressed on temperature. Now get the temperature right. My point is that these issues that you bring up are absolutely not important for the quality of forecast of demand. You can talk all day long about heteroscedasticity, yet if your temperature forecast is wrong your forecast will be wrong. I'm saying: deal with real issues that impact your forecast meaningfully, i.e. the means. get the mean in the ball park at least, then worry about nuissance – Aksakal Jun 12 '20 at 18:35
-
@Aksakal If your beta hat is inconsistent, you're not "getting the predictors right". If you're omitting relevant predictors (e.g. lagged values), you're not "getting the predictors right". – Michael Jun 12 '20 at 18:38
-
@Michael, when you refer to including lagged values - did you mean a regression model with lagged values, with an ARIMA model for the errors of the regression model or did you mean just a regression model with an LDV? Does that mean there are 3 potential modelling approaches to consider: 1) regression model (with no lagged regressors), with an ARIMA model for the regression model errors, 2) regression model (with lagged regressors), with an ARIMA model for the regression model errors, 3) regression model (with lagged regressors)? How does one know which one is best for the task at hand? – Newwone Jun 12 '20 at 18:42
-
Is it a matter of implementing all 3 approaches and seeing what produces the best out of sample predictions? – Newwone Jun 12 '20 at 18:42
-
@Michael, it's not just omitting variables, it's having the right model specification. In the models of this kind pretending that it is possible to *not* omit a variable is laughable. Even if you didn't omit the variable, how do you know it's in the right form and lag etc.? It's impossible. All econometric models omit a ton of variables and also put them in wrong forms. So what? You have to get the big ones in and hope that the forecasts will not be a total garbage. If you can build a *correct* electricity demand model sell it to any utility company. They'll buy it. – Aksakal Jun 12 '20 at 18:43
-
@Newwone The issue caused by omitted lag values are exactly the same, regardless of whether you have other regressors $x_t$ in the model. – Michael Jun 12 '20 at 18:43
-
@Aksakal, thank you for your comments. Let's assume that the forecaster has good domain knowledge and has a pretty good idea of what the non-lagged regressors should be – Newwone Jun 12 '20 at 18:44
-
@Newwone, I think I have pretty good domain knowledge in some areas, yet I'm never sure whether I got the regressors right. I always doubt, and monitor my models' performance, and keep tweaking them. I know that it would be nice to have a neat theoretical discussion, but the practice is such that having the predictor set right is the main concern, and it is never truly resolved. – Aksakal Jun 12 '20 at 18:46
-
I should also note that we already have all the (future) values of the regressors. The only variable that needs actual forecasting is the dependent variable. – Newwone Jun 12 '20 at 18:49
-
@Aksakal, may i ask - what are the modelling / quantitative criteria you use to determine that you have a good forecasting model? – Newwone Jun 12 '20 at 19:08
-
@Michael, would be interesting to know yours as well – Newwone Jun 12 '20 at 19:12
-
@RichardHardy, what are the modelling / quantitative criteria you use to determine that you have a good forecasting model? – Newwone Jun 12 '20 at 19:13
-
Perhaps I should post this as a new question? Looks like there's no clear cut answer and it would be interesting I think to see how people do it. – Newwone Jun 12 '20 at 19:13
-
I posted it as a new question here: https://stats.stackexchange.com/questions/471858/how-do-you-determine-you-have-a-good-timeseries-forecasting-model – Newwone Jun 12 '20 at 19:38
-
@Michael, sorry for repeating myself, but I believe my question about predictive consistency could shed light also on your discussion with Aksakal. What if $\hat\beta$ is predictively consistent? Or is it impossible in the present context? I would also appreciate your take on my other questions in the comments above. – Richard Hardy Jun 12 '20 at 19:57
-
@RichardHardy "What if β^ is predictively consistent"--- I don't know what you mean by "predictively consistent". β^ either consistently estimate β, or not. Here one has the situation where, purely statistically, one can't know. From a practical perspective, say one is willing/able to disregard consistency issues, predictors that are "obvious" should still be included. Besides other symptoms, serial correlation in the residuals is one possible indication that lagged values that should be included in the regression are being omitted. Another indication could come from, e.g. univariate models. – Michael Jun 12 '20 at 21:56
-
@RichardHardy "In the first sentence, you define best prediction as being linear..."---the first sentence does not do that. The conditional mean is not the best linear prediction, it is better than that. Its mean square prediction error is less than, or equal to, that of best linear prediction. It's also not linear, in general. – Michael Jun 12 '20 at 22:59
-
@Michael, I am reiterating my question about [predictive consistency](https://stats.stackexchange.com/questions/265739) (the same link as above). It is a question about how you define $\beta$. In prediction, you may care about the "predictive" $\beta$, not the "causal" one. Regarding *In the first sentence, you define...* you write a formula that shows the best prediction (the conditional mean) is linear: $E(y|x)=\beta x$, which leaves me confused as to how this is in contrast to a linear predictor. Generally it could be, but clearly in this case it is not as you spell it out in the formula. – Richard Hardy Jun 13 '20 at 07:22
-
@RichardHardy "...you write a formula that shows the best prediction (the conditional mean) is linear: E(y|x)=βx..."---yes, if β^ is consistent, then the predicted value β^x would be a consistent estimate, conditional on x, of the best prediction. If β^ is not consistent, e.g. when relevant predictors are omitted, then β^ has a probability limit something like (β + bias) = cov(x,y)/var(x) and β^x still consistently estimates the best linear prediction. – Michael Jun 13 '20 at 08:05
-
@Michael, OK, that is in line with my own understanding of the problem. But I am beginning to drown in the details, and the lack of $\text{do}()$ or counterfactual notation to distinguish causal parameters from probabilistic parameters in your answer is not helping. The bigger point is, I wonder if the whole discussion of the structural causal model (a term used by Judea Pearl in his causal language) vs. the probabilistic model is helpful in the context of the OPs question. If $\hat\beta$ is predictively consistent regardless of exogeneity, why are you focusing on the causal $\beta$? – Richard Hardy Jun 13 '20 at 08:33
-
@RichardHardy The MSPE in the two cases are not the same. There's a reason for the two different notions---best prediction and best linear prediction. It just so happens that, in this context, "true beta" enters into the expression for best prediction, which gives the smallest MSE. This tells you the better you capture beta (or, in practice, the less you miss it), the better the prediction. That beta also captures causal impact is only incidental (nothing is mentioned about causal impact in the answer). – Michael Jun 13 '20 at 08:53
-
@RichardHardy, there's a post (and I don't have the link to it at hand but I think I might be able to find it if you need me to) where you mentioned something like a good forecasting model should not have (auto)correlation in the residuals. Do you still have this view, and why? – Newwone Jun 13 '20 at 16:50
-
@Newwone, I largely agree. If there is autocorrelation and (a crucial condition!) we are able to estimate it with sufficient precision, then we can improve our forecasts by accounting for the autocorrelation in the model. If we cannot estimate it with sufficient precision, then the model cannot be improved in this respect and autocorrelation becomes less of a concern. – Richard Hardy Jun 13 '20 at 19:34
-
@Newwone, true goodness of a predictive model is a subjective matter. there are somewhat objective criteria such CV performance, in-sample goodness of fit etc. However, in some domains, like mine, it is also important that a model makes a sense to users. It may sound silly but if I can't explain the model to users, they won't trust it, then won't use it. It somehow should correlate with their experience. Also it should reconcile with other models etc. There are many aspects in modeling that are not scientific or even quantitative. That's why it's best here to stick to narrow subjects and focus – Aksakal Jun 16 '20 at 14:39
In practice of forecasting there's very little that is absolute. This is one such case where there is not prescribed course of actions. Presumably you started with a time series regression model $y_t=X_t\beta+\varepsilon_t$ where $\varepsilon_t\sim\mathcal N(0,\sigma^2)$.
Once you looked at residuals $\hat\varepsilon_t$ and noticed that they're autocorelated, you decided to improve the model and apply regARIMA model: $$y_t=X_t\beta+\varepsilon_t$$ where $\varepsilon_t=\phi_1\varepsilon_{t-1}+u_t$ with $u_t\sim\mathcal N(0,\sigma^2_u)$
Then you find that residuals $\hat u_t$ are autocorrelated. Now what? You could try to fit higher order ARIMA(p,d,q) instead if the first attempt with AR(1). In fact if you pick high enough orders of P,D,Q, I bet that at some point residuals $\hat u_t$ will start looking like a white noise. Should you do this? Maybe, maybe not. It's up to you.
I prefer parsimonious models, and dislike high order models, especially when it comes to differencing D. You also need to be careful with autocorrelation measures since they're sensitive to outliers. For instance, you may have two big events 6 months apart and if the dataset is not large, they'll appear like 6 month frequency seasonality.

- 55,939
- 5
- 90
- 176
-
Thank you. The dataset is large. Question - if you stop at a parsimonious model but with autocorrelated residuals, will there not be concern around the accuracy of the points forecasts? – Newwone Jun 11 '20 at 22:43
-
I think as long as you are not using lags of Y as a predictor autocorrelated residuals will not reflect bias in your point forecast (which is effectively the same as saying they are accurate). I think that is similar to a point already made. – user54285 Jun 11 '20 at 23:33
-
@user54285, unless I have misread Michael's comments, my understanding is that if the model should include Y lags but they have been omitted, exogeneity is violated and hence your point forecasts will not be correct - if the model residuals have correlation – Newwone Jun 11 '20 at 23:49
-
There’s usually a bigger fish to fry than autocorrelation in residuals of residuals. – Aksakal Jun 12 '20 at 01:16
-
I think a critical point, easy to miss, if that if you do not have lags of Y as predictors there will be no bias even if autocorrelation occurs. It will cause problems with your tests but not bias the point estimate. But if you do have the wrong lags of Y in your model then you may encounter bias. But that is no different than any predictor being misspecified. Auto correlation can, I think, be the sign of different problems some of which involve bias and some which do not. – user54285 Jun 12 '20 at 18:08
-
@user54285, I wonder if one can apply the notion of [predictive consistency](https://stats.stackexchange.com/questions/265739) and not worry about the bias you are warning us about. – Richard Hardy Jun 12 '20 at 20:04
-
@Richard Hardy In honesty that was beyond my understanding of statistics, in particular how likely it was for p consistency to exist in reality. I think bias tied to model mis-specification , as would seem to be occurring here, would always be a concern. – user54285 Jun 12 '20 at 22:02
-
@user54285, the notion of predictive consistency is actually surprisingly powerful, so my guess would be in the positive; see the blog post by Francis X. Diebold that I refer to in the linked thread, he is pretty optimistic about it. Michael has also added his answer in that thread, quite a positive one. – Richard Hardy Jun 16 '20 at 14:33
-
@Richard Hardy thanks. This is an entirely new concept to me. I spend a lot of my days worrying about bias so its a very important one. – user54285 Jun 16 '20 at 21:50
-
@user54285, bias is easy to deal with. just adjust for it, there are many ways like intercept correction etc. worry about shift of the mean – Aksakal Jun 16 '20 at 22:22
-
@Aksakal I have read many treatments of bias over the years and none suggested a method for corrections. Do you know a source I can go to to look at those. When you say shift of the mean do you mean a structural beak? When the relationship between X and Y changes or simply :) when the series mean changes? – user54285 Jun 16 '20 at 23:35