6

I read that OLS underestimates variance when residuals are autocorrelated. I see why autocorrelation would be a problem in time series analysis, in the sense that the coefficient are not efficient because we're not including all the potential predictors. But is there a mathematical problem as well?

For example, we want to predict the used-cars sales margins. The data set includes each vehicle make, model, mileage p/gallon, price, options etc. and the final sale price. For some reason the catalog has been sorted by car make and year/model, so adjacent observations will likely have similar sale numbers. Is autocorrelation a problem in this case?

Robert Kubrick
  • 4,078
  • 8
  • 38
  • 55
  • It seems like the issue is heteroskedasticity, but if you're talking about panel data, then it might be both. People typically use Newey-West errors to deal with these issues. – John Aug 27 '14 at 15:18
  • I will offer a provisional "you are clear", although *provisional* because autocorrelation outside time-series or spatial series is a bit outside my ken. – Alexis Aug 27 '14 at 18:52
  • 3
    Have you examined the OLS output after re-sorting your data? (If not, you might consider trying it: just permute the cases randomly or sort them on some other variables.) Only the things that change in the output will be of concern to you (insofar as the sorting goes) :-). – whuber Aug 27 '14 at 20:13
  • @whuber would you suggest something like binning/histograming and making statistics out of the distribution of autocorrelation measures from some large-ish number of permutations of the data order? Are there formal tests and inferences along those lines? – Alexis Aug 28 '14 at 05:01
  • @whuber Well, after sorting the data by car color, nothing changed in the R lm regression output. I gather that the residuals autocorrelation assumptions and/or requirements I keep reading about are wrong. It's only a matter of efficiency in time series analysis. – Robert Kubrick Aug 28 '14 at 12:16
  • 1
    How your data is sorted has nothing to do with regression output (this is what whuber was trying to show to you by asking you to rerun). And also nothing to do with autocorrelation or heteroskedasticity or clustered errors or any problems that may show up. – Affine Aug 28 '14 at 13:07
  • 1
    @Affine Of course it has to do with the autocorrelation of residuals. If you change the order of the data set (and thus the order of residuals) you will have different correlation values at lag $x$. – Robert Kubrick Aug 28 '14 at 14:42
  • @RobertKubrick it **is not** only a matter of efficiency in time series (see my answer). – Alexis Aug 28 '14 at 15:47
  • @Alexis We're diverging from the original question. Even so, I don't understand what do you mean by "far worse than low efficiency". Of course excluding $Y_t-1$ from the model changes the conditional mean and sd. Of course $R^2$ is worse. You can extend that statement to any model that does not include a critical covariate. – Robert Kubrick Aug 28 '14 at 16:08
  • @RobertKubrick Integrated data **cannot** be estimated without bias, regardless of sample size becuase $\mu=\varnothing$ and $\sigma^{2}=\infty$; defined $\bar{x}$ and finite $s^{2}$ must be biased. This is not a question of efficiency. – Alexis Aug 28 '14 at 16:14
  • @Alexis How does the variance of the population relates to the variance of the coefficients? Let's say we have a covariate $X_2$ with a 0.75 correlation against $Y_t-1$. We're not using $Y_t-1$ in our model predictors. Why $X_2$ would be biased? – Robert Kubrick Aug 28 '14 at 16:36
  • @RobertKubrick If there is autocorrelation/heteroscedasticity **inherent** in your dataset, it will show up in your regression **regardless** of how you sort your dataset. If there is autocorrelation between years, it doesn't matter if you sort by year or by color before running your regression. The regression results you get will be exactly the same, and both sets will be impacted by the autocorrelation. – Affine Aug 28 '14 at 16:49
  • @RobertKubrick In the simple OLS case of one variable the coeffcient is estimated as the summed average product of $X$ and $Y$ deviations about their sample means divided by the summed squared deviation in $X$... these are biased estimates of the corresponding population quantities. – Alexis Aug 28 '14 at 17:08
  • @Alexis I think you're confusing biased with best. Last sentence of the second paragraph: http://en.wikipedia.org/wiki/Autocorrelation#Regression_analysis – Robert Kubrick Aug 28 '14 at 18:10
  • @Affine What do you mean by "show up"? Do you have any reference on this? – Robert Kubrick Aug 28 '14 at 19:07
  • @RobertKubrick I may well be confused. However, integrated (i.e. random walk) type autocorrelated data have some strange properties stemming from the undefined mean and infinite variance of the population. For example, one would expect to find two unrelated random walks to be correlated (including in a regression context) regardless of sample size (hence my taking issue with efficiency in your point about time series). – Alexis Aug 28 '14 at 21:10
  • I'm not sure where the hang up on sorting is coming from. A silly example - let's take the S&P 500 index. This is an autocorrelated time series. Let's say you also measure something else X and regress the index against X. It doesn't matter if you've sorted your dataset by X or by time, autocorrelation is inherent. – Affine Aug 28 '14 at 23:33
  • What may be confusing you is that a common method of *diagnosing* autocorrelation in the time series domain is to plot residuals against the time ordering of your data. This is something you need to do yourself, R's `plot` on the `lm` object only plots residuals vs fitted (which does allow you diagnose other types of heterscedasticity, such as the infamous funnel). – Affine Aug 28 '14 at 23:38
  • @Affine I think you're confusing autocorrelation and unit root. The SP500 example you gave suffer from unit root (inherently as you say), but not from autocorrelation. (calculate a random series correlation against its lag 1, then add 1,000 to each value of the same series and recalculate the correlation. Does it change? Yet you have introduced a unit root). – Robert Kubrick Aug 29 '14 at 14:37

2 Answers2

4

Correlated residuals in time series analysis may imply far worse than low efficiency: if the structure of autocorrelation implies integrated or near-integrated data, then any inferences about levels, means, variances, etc. may be spurious (with unknown direction of bias) because the population mean is undefined and the population variance is infinite (so, for example, the finite values $\bar{x}$ and $s_{x}$, and quantities derived from these are always false estimates of the corresponding population statistics).

That's not a problem that can be resolved by increasing sample size to offset inefficiency.

If autocorrelated errors obtain in OLS, I would say that the same issues may be present (it depends on the data generating process). Again: not an issue of efficiency.

The critical caveat is whether ordering of your data is meaningful: if the order has meaning in that it relates to the data generating process then you're in trouble.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • Data ordering is not critical, as I explained in the example. I'm still not clear why the estimates would be wrong in this case. – Robert Kubrick Aug 27 '14 at 16:40
1

1) The time series auto-correlation you refer is the correlation between a time series and the time-shifted series; "time" is observed when the data are collected. In your example, auto-correlation by shifting car maker or model is not very meaningful. For new cars, shifting year (comparing year over year sales of the same type of car) makes sense, but for used cars it would be less meaningful, since the random usage the car would have being exposed to would erase correlations if there was one. I think you are fine going ahead applying the OLS technology

2) You would be fitting an unbiased linear estimator, a special case of an M-estimator. If your objective is to build a predictive model (as oppose to testing hypothesis expresable in terms of model parameter), then the OSL is appropiate. To cover for the possibility un-met model assumptions, use a training to build your model, and a validation sample to assess its performance on out-of-sample cases.

VictorZurkowski
  • 349
  • 2
  • 9