14

I am trying to figure out how R computes lag-k autocorrelation (apparently, it is the same formula used by Minitab and SAS), so that I can compare it to using Excel's CORREL function applied to the series and its k-lagged version. R and Excel (using CORREL) give slightly different autocorrelation values.

I'd also be interested to find out whether one computation is more correct than the other.

Galit Shmueli
  • 1,090
  • 8
  • 10
  • `R`'s formula is further analyzed and explained at http://stats.stackexchange.com/questions/81754/understanding-this-acf-output/81764#81764. – whuber Jan 14 '14 at 15:58

2 Answers2

18

The exact equation is given in: Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer-Verlag. I'll give you an example:

### simulate some data with AR(1) where rho = .75
xi <- 1:50
yi <- arima.sim(model=list(ar=.75), n=50)

### get residuals
res <- resid(lm(yi ~ xi))

### acf for lags 1 and 2
cor(res[1:49], res[2:50])      ### not quite how this is calculated by R
cor(res[1:48], res[3:50])      ### not quite how this is calculated by R

### how R calculates these
acf(res, lag.max=2, plot=F)

### how this is calculated by R
### note: mean(res) = 0 for this example, so technically not needed here
c0 <- 1/50 * sum( (res[1:50] - mean(res)) * (res[1:50] - mean(res)) ) 
c1 <- 1/50 * sum( (res[1:49] - mean(res)) * (res[2:50] - mean(res)) ) 
c2 <- 1/50 * sum( (res[1:48] - mean(res)) * (res[3:50] - mean(res)) ) 
c1/c0
c2/c0

And so on (e.g., res[1:47] and res[4:50] for lag 3).

Wolfgang
  • 15,542
  • 1
  • 47
  • 74
  • Thanks Wolfgang! This is exactly what I was looking for. Now I can try and replicate it in Excel (for my students who use Excel only). – Galit Shmueli May 20 '11 at 04:09
11

The naive way to calculate the auto correlation (and possibly what Excel uses) is to create 2 copies of the vector then remove the 1st n elements from the first copy and the last n elements from the second copy (where n is the lag that you are computing from). Then pass those 2 vectors to the function to calculate the correlation. This method is OK and will give a reasonable answer, but it ignores the fact that the 2 vectors being compared are really measures of the same thing.

The improved version (as shown by Wolfgang) is a similar function to the regular correlation, except that it uses the entire vector for computing the mean and variance.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 3
    The other difference is the factor 1/n instead of 1/(n-k) where n is the length of the series and k the number of lags. This is to ensure that the autocorrelation matrix is positive definite. – Rob Hyndman May 19 '11 at 01:21
  • 1
    @Rob: I believe that Excel's CORREL formula does use n (rather than n-k). For example, for lag-1 ACF you get the same result (using Wolfgang's notation) if you use COVAR(res[1:49],res[2:50])/(STDEVP(res[1:49])*STDEVP(res[2:50])) and the functions COVAR and STDEVP are "population" statistics. – Galit Shmueli May 20 '11 at 04:14
  • @Galit. Even using COVAR and STDEVP, the denominator in your code would be 49, but the preferred definition of autocorrelation will use 50. – Rob Hyndman May 20 '11 at 06:02
  • 1
    The 1/n vs 1/(n-k) point is good for understanding in addition to the other points above. But for practical/observed numbers it is not going to matter as long as it is consistent, since that term shows up in both the numerator and the denominator it will cancel out. You could get a problem if different fractions were used in the numerator and denominator. – Greg Snow May 20 '11 at 15:58