What is the proper way of calculating the auto-correlation of a limited length real valued data series?
Searching books and the web one can find two different methods for calculating the auto-correlation function $f_{\text{ac}}$ of a given data series $\{x_i\}$ at lag $l$:
- $f_{\text{ac}}(l)= \frac{\big\langle(x_i-\mu)(x_{i+l}-\mu)\big\rangle}{\sigma^2}$, where $\mu$ is the mean, $\sigma^2$ the variance of the data and $\langle\bullet\rangle$ denotes the mean over the data series.
- Using all possible pairs $(x_i, x_{i+l})$ as input for the correlation coefficient.
I think both methods converge if the length of the 1d-input goes to infinity, but for smaller samples they will be closeish but different (for big lag-times more than for smaller ones). I guess this is because of different normalization (the variance of the complete series vs. the standard deviations of the 'shortened' series).
As a beginner student I'm a bit confused. Are this two methods totally different things, or is one the real one, and the other a good enough approximation?
Example implementations of both methods in python:
Method 1:
def acf_brute(x, maxlag):
n = len(x)
x = x-x.mean()
def _foo(lag):
return (x[:n-lag]*x[lag:]).sum()
foo = np.vectorize(_foo)
r = foo(np.arange(maxlag))
return r / r[0] * n/np.arange(n,n-maxlag,-1)
Method 2:
def acf_npcorrcoef(x,maxlag):
n=len(x)
def _foo(lag):
return np.corrcoef(x[:n-lag],x[lag:])[0,1]
foo = np.vectorize(_foo)
return foo(np.arange(maxlag))