10

I'm trying to understand the "mechanism" behind the calculation of ACF values in a time series. As a "prove-it-to-myself exercise" [NOTE: I updated the code in this link to reflect the information in the accepted answer], I am not focusing on the elegance of the code, and I use loops on purpose. The problem is that, although the values I get are close to those generated by acf(), they are not equal, and at some points, not all that close. Is there any conceptual misunderstanding?

As credited on the online notes linked above, the data was generated following this online example.

set.seed(0)                          # To reproduce results
x = seq(pi, 10 * pi, 0.1)            # Creating time-series as sin function.
y = 0.1 * x + sin(x) + rnorm(x)      # Time-series sinusoidal points with noise.
y = ts(y, start = 1800)              # Labeling time axis.
model = lm(y ~ I(1801:2083))         # Detrending (0.1 * x)
st.y = y - predict(model)            # Final de-trended ts (st.y)
ACF = 0                              # Empty vector to capture the auto-correlations.
ACF[1] = cor(st.y, st.y)             # The first entry in the ACF is the cor with itself.
for(i in 1:24){                      # Took 24 points to parallel the output of `acf()`
  lag = st.y[-c(1:i)]                # Introducing lags in the stationary ts.
  clipped.y = st.y[1:length(lag)]    # Compensating by reducing length of ts.
  ACF[i + 1] = cor(clipped.y, lag)   # Storing each correlation.
}
w = acf(st.y)                        # Extracting values of acf without plotting.
all.equal(ACF, as.vector(w$acf))     # Checking for equality in values.
# Pretty close to manual calc: Mean relative difference: 0.03611463"

To get a tangible sense of the relative outputs, this is what the autocorrelations look like calculated manually:

1.0000000 0.3195564 0.3345448 0.2877745 0.2783382 0.2949996 ... 
... -0.1262182 -0.1795683 -0.1941921 -0.1352814 -0.2153883 -0.3423663

as opposed to the R acf() function:

1.0000000 0.3187104 0.3329545 0.2857004 0.2745302 0.2907426 ...
... -0.1144625 -0.1621018 -0.1737770 -0.1203673 -0.1924761 -0.3069342

As suggested, this is likely explained comparing to the code in the call within acf() in the line:

acf <- .Call(C_acf, x, lag.max, type == "correlation")

How can this C_acf function be accessed?


I am having similar issues with the PACF, which I presume are related. The code for the PACF values is here. [NOTE: In this case I suspect it is actually a rounding difference].

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • Have you taken a look at the C code that R uses to compute the ACF? – Jon Jan 02 '17 at 20:47
  • @Jon I looked at the R function, and saw that ultimately it refers the data to the `C_acf` procedure (?), and looked it up online without success. I am completely unfamiliar with C. – Antoni Parellada Jan 02 '17 at 20:50
  • FYI http://stackoverflow.com/q/14035506/1864816 – Scortchi - Reinstate Monica Jan 02 '17 at 22:50
  • 1
    I downloaded the most recent R source code tarball (R-3.3.2.tar.gz) and searched for "C_acf" but found nothing. I found the acf.R file, which uses C_acf, but not the actual C file. I did however find the pacf C file. This is quite strange. – Jon Jan 03 '17 at 00:01
  • @Jon Thanks for looking into it. It seems as though someone has to know the answer; it never helps when a question entails sifting through a chunk of code, like in my OP. It's possible that a more straight to the point question, possibly in SO, could do the trick. – Antoni Parellada Jan 03 '17 at 00:39
  • The problem at this stage is finding the C file. Your question involves understanding how R's algorithm works as your calculations vary significantly. I don't see how posting this question on SO would help. – Jon Jan 03 '17 at 00:43
  • @Jon Yes, asked as a software-specific question, such as, "how to find the C file routine", it would arguably be off-topic in CV. But it's a moot point. – Antoni Parellada Jan 03 '17 at 00:46
  • Antoni, have you seen http://stats.stackexchange.com/questions/81754? – whuber Jan 03 '17 at 17:45
  • @whuber +1 and thank you. No, I hadn't. I look forward to reading carefully your post for a deeper understanding of autocorrelation across time. – Antoni Parellada Jan 03 '17 at 17:48

1 Answers1

7

Apparently cor and acf are not using the same divisor. acf uses as divisor the number of observations and can be reproduced as shown below. For details, you can locate the code of C_acf in the file R-3.3.2/src/library/stats/src/filter.c (procedures acf and acf0):

set.seed(0)                          # To reproduce results
x = seq(pi, 10 * pi, 0.1)            # Creating time-series as sin function.
y = 0.1 * x + sin(x) + rnorm(x)      # Time-series sinusoidal points with noise.
y = ts(y, start = 1800)              # Labeling time axis.
model = lm(y ~ I(1801:2083))         # Detrending (0.1 * x)
st.y = y - predict(model)  
lag.max <- 24
ydm <- st.y - mean(st.y)
n <- length(ydm)
ACF <- rep(NA, lag.max+1)
for (tau in seq(0, lag.max))
  ACF[tau+1] <- sum(ydm * lag(ydm, -tau)) / n
ACF <- ACF/ACF[1]
names(ACF) <- paste0("lag", seq.int(0, lag.max))
ACF
head(ACF)
# lag0      lag1      lag2      lag3      lag4      lag5 
# 1.0000000 0.3187104 0.3329545 0.2857004 0.2745302 0.2907426 
tail(ACF)
# lag19      lag20      lag21      lag22      lag23      lag24 
# -0.1144625 -0.1621018 -0.1737770 -0.1203673 -0.1924761 -0.3069342 
all.equal(ACF, acf(st.y, lag.max=lag.max, plot=FALSE)$acf[,,1], check.names=FALSE)
# [1] TRUE
javlacalle
  • 11,184
  • 27
  • 53
  • Thank you. I think you answer the question, and plan on accepting the answer after going over the code with a debugger. I searched online for the file you post, and found it on a fossies.org page. The code there looks rather different - Did you adapt it to R yourself? Is there any other place I should be searching for it? In general, what is the procedure to find these C or Fortran functions? – Antoni Parellada Jan 03 '17 at 13:10
  • I didn't follow the C source code. I simply tried to do a a concise and illuminating implementation for illustration of the operations involved. In practice, it is better to use `acf`, which will for example deal with missing observations. – javlacalle Jan 03 '17 at 13:30
  • Very nice! Can you please add some context to the file you quote? I'll accept your answer anyway, but it would be nice to have a bit of an explanation as to how to resolve these situations where R points to some function in C or Fortran. – Antoni Parellada Jan 03 '17 at 13:33
  • The C or Fortran code is located in the directory `src` of the source packages. The package a function belongs to can be found in the header of the documentation (left-hand-side in brackets). In order to find the function in the directory `src`, it is helpful to use some searching tool such as [grep](https://en.wikipedia.org/wiki/Grep). – javlacalle Jan 03 '17 at 13:33
  • Do you access `src` from within RStudio? I use Windows, and was became acquainted with `grep` yesterday reading about it on SO, but I couldn't figure it out... – Antoni Parellada Jan 03 '17 at 13:35
  • I access `src` from the sources available on [CRAN](https://cran.r-project.org/sources.html), file R-3.3.2.tar.gz. Note that the sources of the base packages are on a directory named `src` as well. For other packages not distributed in the above file you can get the sources from its CRAN page, which has the general form https://cran.r-project.org/package=PKG_NAME. – javlacalle Jan 03 '17 at 13:39