1

What is the proper way of calculating the auto-correlation of a limited length real valued data series?

Searching books and the web one can find two different methods for calculating the auto-correlation function $f_{\text{ac}}$ of a given data series $\{x_i\}$ at lag $l$:

  1. $f_{\text{ac}}(l)= \frac{\big\langle(x_i-\mu)(x_{i+l}-\mu)\big\rangle}{\sigma^2}$, where $\mu$ is the mean, $\sigma^2$ the variance of the data and $\langle\bullet\rangle$ denotes the mean over the data series.
  2. Using all possible pairs $(x_i, x_{i+l})$ as input for the correlation coefficient.

I think both methods converge if the length of the 1d-input goes to infinity, but for smaller samples they will be closeish but different (for big lag-times more than for smaller ones). I guess this is because of different normalization (the variance of the complete series vs. the standard deviations of the 'shortened' series).

As a beginner student I'm a bit confused. Are this two methods totally different things, or is one the real one, and the other a good enough approximation?

Example implementations of both methods in python:
Method 1:

def acf_brute(x, maxlag):
    n = len(x)
    x = x-x.mean()
    def _foo(lag):
        return (x[:n-lag]*x[lag:]).sum()
    foo = np.vectorize(_foo)
    r = foo(np.arange(maxlag))
    return r / r[0] * n/np.arange(n,n-maxlag,-1)

Method 2:

def acf_npcorrcoef(x,maxlag):
    n=len(x)
    def _foo(lag):
        return np.corrcoef(x[:n-lag],x[lag:])[0,1]
    foo = np.vectorize(_foo)
    return foo(np.arange(maxlag))
  • 1
    Questions solely about how software works are [off-topic](http://stats.stackexchange.com/help/on-topic) here, but you may have a real statistical question buried here. You may want to edit your question to clarify the underlying statistical issue. You may find that when you understand the statistical concepts involved, the software-specific elements are self-evident or at least easy to get from the documentation. – gung - Reinstate Monica Dec 13 '16 at 17:04
  • How is this question solely about software? (E.g. the two methods can also be done by hand.) – suugakugasukidesu Dec 13 '16 at 17:07
  • That's what I mean. You may want to edit this to be about how this can be calculated (not how w/ *Python*). Or write out the procedures you have in mind in pseudocode or mathematically. – gung - Reinstate Monica Dec 13 '16 at 17:18
  • 1
    I suspect you might find http://stats.stackexchange.com/questions/81754/understanding-this-acf-output/81764#81764 relevant and useful. – whuber Dec 13 '16 at 17:50
  • @gung I see your point, and tried to reformulate the question with my limited knowledge of correct math formulation. – suugakugasukidesu Dec 13 '16 at 18:00
  • @whuber Thank you for the link and your explanation there! To sum it up: The correlation function (method 1) at lag l is an approximation of the correlation coefficients of the series with it self at lag l (method 2). Is that correct? Is there a reason, apart of computation time, why one does not always use the correlation coefficients? – suugakugasukidesu Dec 13 '16 at 19:36

0 Answers0