2

I have two time series, for example:

a = c(2, 1, 2, 1, 2, 1, 2)
b = c(NA, NA, 1, 2, 1, 2, 1)
ccf(a, b, na.action=na.omit, plot=FALSE)

The results of ccf shows the following:

Autocorrelations of series ‘X’, by lag

    -3     -2     -1      0      1      2      3 
 0.400 -0.567  0.800 -1.000  0.800 -0.567  0.400 

When lag equals 0, the ccf values is -1. However, I can't figure out why the value is 0.8 (lag = -1) and -0.567 (lag = -2).

I've read the link from Why do I get different results using ccf() and cor() in R?. But it is based on acf and doesn't contains NAs.

How to calculate it when it contains NA?

Specifically, what is the formula when calculate when lag = -2 in this toy example ?

oszkar
  • 736
  • 4
  • 15
Xi Wang
  • 21
  • 1

1 Answers1

2

As the simple correlation coefficient between the lagged series from the sample gives biased estimation of the population correlation coefficient $\rho_{ij} \left( t \right)$, an unbiased estimator should be applied.

If you take a look at the built in help (?ccf), there is a reference there to the book Venables, W. N. and Ripley, B. D. (2002): Modern Applied Statistics with S. Fourth Edition. Springer-Verlag. On page 390 you can find the estimation formula for ccf:

$$c_{ij}\left( t \right) = \frac{1}{n} \sum_{s = \max \left( 1, -t \right)}^{\min\left( n - t, n \right)}{\left( X_i \left( s + t \right) - \overline{X_i} \right) \left( X_j\left( s \right) - \overline{X_j} \right)}, \qquad r_{ij}\left( t \right) = \frac{c_{ij}\left( t \right)}{\left| c_{ij}\left( 0 \right) \right|}$$

(Actually $r_{ij} \left( t \right)$ is not there, but it can be easily deducted from acf functions $r_t$. The latter is $r_t = \frac{c_t}{c_0}$ there, without the absolute value in the denominator, as $c_0$ is always positive in case of acf, but it is obviously needed in case of ccf (think about $r_{ij} \left( 0 \right) = -1$ as the case with a and b in this question).

As

a <- c(2, 1, 2, 1, 2, 1, 2)
b <- c(NA, NA, 1, 2, 1, 2, 1)
ccf(a, b, na.action=na.omit, plot=FALSE)

is equivalent with

a <- c(2, 1, 2, 1, 2)
b <- c(1, 2, 1, 2, 1)
ccf(a, b, plot=FALSE)

with the result

Autocorrelations of series ‘X’, by lag

    -3     -2     -1      0      1      2      3 
 0.400 -0.567  0.800 -1.000  0.800 -0.567  0.400 

you can check the calculations applying the above formulas 'manually' with the next R code:

a <- c(2, 1, 2, 1, 2)
b <- c(1, 2, 1, 2, 1)
n <- length(a)
c_0 <- abs(1 / n * sum((a - mean(a)) * (b - mean(b))))
for (t in -3:3) {
  if (t <= 0) {
    c_t <- 1 / n * sum((a[1:(n + t)] - mean(a)) * (b[(1 - t):n] - mean(b)))
  } else {
    c_t <- 1 / n * sum((a[(1 + t):n] - mean(a)) * (b[1:(n - t)] - mean(b)))
  }
  r_t <- c_t / c_0
  print(r_t)
}

with results

[1] 0.4
[1] -0.5666667
[1] 0.8
[1] -1
[1] 0.8
[1] -0.5666667
[1] 0.4
oszkar
  • 736
  • 4
  • 15
  • Thank you for your answers. I still have two questions. (1) How to calculate when using `na.action = na.pass` in ccf in this toy example? According to your answer, na.omit ignore the information of the first two value in a. When using `ccf(a, b, na.action=na.pass, plot=FALSE)` , the ccf returns -0.99 (lag = 0), 0.825 (lag = -1). (2) Why does the simple correlation coefficient between the lagged series from the sample give biased estimation of the population correlation coefficient ? Thank you in advance. – Xi Wang Apr 11 '20 at 07:57
  • I think these are separate questions, so it is not the best idea to discuss them in the comments, but some quick hints: (1) from `ccf` help using `na.pass` "*means that the estimate computed may well not be a valid autocorrelation sequence, and may contain missing values*". So I think it's better to omit `NA`s. To get more insight, I think you have to check the source code of `ccf`. (2) Check this: https://stats.stackexchange.com/questions/220961/is-the-sample-correlation-coefficient-an-unbiased-estimator-of-the-population-co – oszkar Apr 11 '20 at 09:04