31

I have observed that, on average, the absolute value of Pearson correlation coefficient is a constant close to 0.560.42 for any pair of independent random walks, regardless of the walk length.

Can someone explain this phenomenon?

I expected the correlations to get smaller as the walk length increases, like with any random sequence.

For my experiments I used random gaussian walks with step mean 0 and step standard deviation 1.

UPDATE:

I forgot to center the data, that's why it was 0.56 instead of 0.42.

Here is the Python script to compute the correlations:

import numpy as np
from itertools import combinations, accumulate
import random

def compute(length, count, seed, center=True):
    random.seed(seed)
    basis = []
    for _i in range(count):
        walk = np.array(list(accumulate( random.gauss(0, 1) for _j in range(length) )))
        if center:
            walk -= np.mean(walk)
        basis.append(walk / np.sqrt(np.dot(walk, walk)))
    return np.mean([ abs(np.dot(x, y)) for x, y in combinations(basis, 2) ])

print(compute(10000, 1000, 123))
Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
Adam
  • 411
  • 4
  • 5
  • My first thought is that as the walk gets longer it is possible to get values with a larger magnitude, and the correlation is picking up on that. – John Paul Jan 10 '17 at 20:11
  • But this would work with any random sequence, if I understand you right, yet only the random walks have that constant correlation. – Adam Jan 10 '17 at 20:21
  • 4
    This is not just any "random sequence": the correlations are extremely high, because each term is just one step away from the preceding one. Note, too, that the correlation coefficient you are computing is not that of the random variables involved: it's a correlation coefficient for the sequences (thought of simply as paired data), which amounts to a big formula involving various squares and differences of all the terms in the sequence. – whuber Jan 10 '17 at 20:50
  • 10
    Are you talking about correlations *between* random walks (across series not within one series)? If so, it's because your independent random walks are integrated but not cointegrated, which is a well-known situation where spurious correlations will appear. – Chris Haug Jan 10 '17 at 21:15
  • What are you doing to generate the `.56` number? I don't see why the absolute value of the sample Pearson correlation coefficient should converge at all. (And in my testing, it doesn't.) – Matthew Gunn Jan 10 '17 at 22:12
  • 9
    If you take a first difference, you will find no correlation. The lack of stationarity is the key here. – Paul Jan 10 '17 at 22:16
  • Most probably, you are using pseudo-random numbers generated by an algorithm and started with a seed. Is that seed always the same or does it change? And you may have to specify further what you mean by the correlation: Do you collect all possible points where a random walk has been or do you have a lot of random walks and collect their endpoints? – Mayou36 Jan 10 '17 at 23:01
  • If OP was re-seeding with the same seed, the series would be identical, not just correllated. – Lagerbaer Jan 11 '17 at 00:59

2 Answers2

28

Your independent processes are not correlated! If $X_t$ and $Y_t$ are independent random walks:

  • A correlation coefficient unconditional on time does not exist. (Don't talk about $\operatorname{Corr}(X, Y)$.)
  • For any time $t$, $\operatorname{Corr}(X_t, Y_t)$ is indeed 0.
  • But sample statistics based upon time-series averages will not converge to anything! The sample correlation coefficient you calculated based upon averaging multiple observations over time is meaningless.

Intuitively, you might guess (incorrectly) that:

  1. Independence between two processes $\{X_t\}$ and $\{Y_t\}$ implies they have zero correlation. (For two random walks, $\operatorname{Corr}(X, Y)$ doesn't exist.)
  2. The time series, sample correlation $\hat{\rho}_{XY}$ (i.e. the correlation coefficient calculated using time-series, sample statistics such as $\hat{\mu_X} = \frac{1}{T} \sum_{\tau = 1}^T X_\tau$) will converge on the population correlation coefficient $\rho_{XY}$ as $T \rightarrow \infty$.

The problem is that neither of these statements are true for random walks! (They are true for better behaved processes.)

For non-stationary processes:

  • You can talk about the correlation between processes $\{X_t\}$ and $\{Y_t\}$ at any two particular points of time (eg. $\operatorname{Corr}(X_2, Y_3)$ is a perfectly sensible statement.)
  • But it doesn't make sense to talk about correlation between the two series unconditional on time! $\operatorname{Corr}(X, Y)$ does not have a well-defined meaning.

The problems in the case of a random walk?

  1. For a random walk, unconditional population moments (i.e. which don't depend on time $t$), such as $\operatorname{E}[X]$, don't exist. (In some loose sense, they are infinite.) Similarly, the unconditional correlation coefficient $\rho_{XY}$ between two independent random walks isn't zero; it in fact doesn't exist!
  2. The assumptions of ergodic theorems don't apply and various time-series averages (eg. $\frac{1}{T} \sum_\tau X_\tau$) don't converge towards anything as $T \rightarrow \infty$.
    • For a stationary sequence, the time series average will eventually converge on the mean that's unconditional on time. But for a non-stationary sequence, there is no mean that's unconditional on time!

If you have various observations of two independent random walks over time (eg. $X_1$, $X_2$, etc... and $Y_1$, $Y_2$, ....) and you calculate the sample correlation coefficient, you will get a number between $-1$ and $1$. But it won't be an approximation of the population correlation coefficient (which doesn't exist).

Instead, $\hat{\rho}_{XY}(T)$ (calculated using time-series averages from $t=1$ to $t=T$) is going to basically be a random variable (taking values in $[-1, 1]$) which reflects the two particular paths the random walks took by chance (i.e. the paths defined by the draw $\omega$ drawn from sample space $\Omega$.) Speaking extremely loosely (and imprecisely):

  • If both $X_t$ and $Y_t$ happened to wander off in the same direction, you'll detect a spurious positive relationship.
  • If $X_t$ and $Y_t$ wandered off in different directions, you'll detect a spurious negative relationship.
  • If $X_t$ and $Y_t$ happened to wander across each other enough, you'll detect a near zero relationship.

You can Google more about this with the terms spurious regression random walk.

A random walk isn't stationary and taking averages over time $t$ won't converge on what you would get by taking iid draws $\omega$ from in sample space $\Omega$. As mentioned in the comments above, you can take first differences $\Delta x_t = x_t - x_{t-1}$ and for a random walk, that process $\{\Delta x_t\}$ is stationary.

Big picture idea:

Multiple observations over time IS NOT the same as multiple draws from a sample space!

Recall that a discrete time stochastic process $\{ X_t \}$ is a function of both time ($t \in \mathbb{N}$) and a sample space $\Omega$.

For averages over time $t$ to converge towards expectations over a sample space $\Omega$, you need stationarity and ergodicity. This is a core issue in much time-series analysis. And a random-walk isn't a stationary process.

Connection to WHuber's answer:

If you can take averages across multiple simulations (i.e. take multiple draws from $\Omega$) instead of being forced to take averages across time $t$, a number of your issues disappear.

You can of course define $\hat{\rho}_{XY}(t)$ as the sample correlation coefficient computed on $X_1\ldots X_t$ and $Y_1 \ldots Y_t$ and this will also be a stochastic process.

You can define some random variable $Z_t$ as:

$$Z_t = |\hat{\rho}_{XY}(t)|$$

For two random walks starting at $0$ with $\mathcal{N}(0,1)$ increments, it's easy to find $E[Z_{10000}]$ by simulation (i.e. taking multiple draws from $\Omega$.)

Below, I ran a simulation of 10,000 calculations of a sample Pearson correlation coefficient. Each time I:

  • Simulated two 10,000 length random walks (with normally distributed increments draw from $\mathcal{N}(0,1)$).
  • Calculated the sample correlation coefficient between them.

Below is a histogram showing the empirical distribution over the 10000 calculated correlation coefficients.

enter image description here

You can clearly observe that the random variable $\hat{\rho}_{XY}(10000)$ can be all over the place in the interval $[-1, 1]$. For two fixed paths of $X$ and $Y$, the sample correlation coefficient doesn't converge to anything as the length of the time series increases.

On the other hand, for a particular time (eg. $t=10,000$), the sample correlation coefficient is a random variable with a finite mean etc... If I take the absolute value and compute the mean over all the simulations, I calculate approximately .42. I'm not sure why you want to do this or why this is at all meaningful??, but of course you can.

Code:

for i=1:10000 
  X = randn(10000,2); 
  Y = cumsum(X); 
  z(i) = corr(Y(:,1), Y(:,2));
end;
histogram(z,20);
mean(abs(z))
Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • Since the sample size obviously is not finite, your assertions about various quantities not existing are puzzling. It's difficult to see how your symbols apply to the situation described by the OP. – whuber Jan 10 '17 at 22:46
  • Your sample size **NEVER EVER** goes to infinity! Not as long as you are drawing samples with a computer, (*only in pure math you may make such assumptions*). And what does that mean: Because you have infinitely many points it does not converge? Where did you read that? – Mayou36 Jan 10 '17 at 22:59
  • @whuber Hopefully this version is a bit clearer. I take it the OP is asking why the sample correlation coefficient (based upon time-series averages) between two finite segments of random walks isn't zero, even for time-series of immense length. A fundamental problem is that for a random walk, various population moments don't exist and time-series averages don't converge to anything. – Matthew Gunn Jan 11 '17 at 00:04
  • Nevertheless, for fixed $n$ everything is finite. Moreover, the expectation of the absolute sample correlation coefficient *does* converge as $n$ increases! Note, too, that the question concerns the *absolute value* of that coefficient. Its expectation (obviously) is zero. – whuber Jan 11 '17 at 00:05
  • 1
    @whuber Do you mean for fixed time-series length $t$, everything is finite? (yes I agree with that.) The expectation of the sample correlation is zero (yes, I agree with that). As $t$ increases though, the sample correlation though does *not* converge on a single point. For two random walk-segments of arbitrary length, the sample correlation coefficient isn't that far from a random draw from the uniform distribution on [0, 1] (see histogram). – Matthew Gunn Jan 11 '17 at 00:12
  • What might be confusing is the OP's assertion that the absolute correlation coefficient is constant: it's not; it cannot possibly be. But it does rapidly converge to a (non-uniform) distribution and its expectation is close to what the OP observes. – whuber Jan 11 '17 at 00:37
  • So is it .42 or .56? You have .56 in your update, but the OP now says it's .42 and you also wrote .42 in a comment to whuber's answer. – amoeba Jan 12 '17 at 13:09
  • @amoeba Thanks for the careful reading! My Monte-Carlo calculation was .42. – Matthew Gunn Jan 12 '17 at 17:29
  • @MatthewGunn +1 for mentioning stationarity, etc. and showing that averaging the results of multiple calculations is much different from a single case where there is absolutely no convergence no matter how large $n$ we choose :) – Adam Jan 14 '17 at 19:30
15

The math needed to obtain an exact result is messy, but we can derive an exact value for the expected squared correlation coefficient relatively painlessly. It helps explain why a value near $1/2$ keeps showing up and why increasing the length $n$ of the random walk won't change things.

There is potential for confusion about standard terms. The absolute correlation referred to in the question, along with the statistics that make it up--variances and covariances--are formulas that one can apply to any pair of realizations of random walks. The question concern what happens when we look at many independent realizations. For that, we need to take expectations over the random walk process.


(Edit)

Before we proceed, I want to share some graphical insights with you. A pair of independent random walks $(X,Y)$ is a random walk in two dimensions. We can plot the path that steps from each $(X_t,Y_t)$ to $X_{t+1},Y_{t+1}$. If this path tends downwards (from left to right, plotted on the usual X-Y axes) then in order to study the absolute value of the correlation, let's negate all the $Y$ values. Plot the walks on axes sized to give the $X$ and $Y$ values equal standard deviations and superimpose the least-squares fit of $Y$ to $X$. The slopes of these lines will be the absolute values of the correlation coefficients, lying always between $0$ and $1$.

This figure shows $15$ such walks, each of length $960$ (with standard Normal differences). Little open circles mark their starting points. Dark circles mark their final locations.

Figure

These slopes tend to be pretty large. Perfectly random scatterplots of this many points would always have slopes very close to zero. If we had to describe the patterns emerging here, we might say that most 2D random walks gradually migrate from one location to another. (These aren't necessarily their starting and endpoint locations, however!) About half the time, then, that migration occurs in a diagonal direction--and the slope is accordingly high.

The rest of this post sketches an analysis of this situation.


A random walk $(X_i)$ is a sequence of partial sums of $(W_1, W_2, \ldots, W_n)$ where the $W_i$ are independent identically distributed zero-mean variables. Let their common variance be $\sigma^2$.

In a realization $x = (x_1, \ldots, x_n)$ of such a walk, the "variance" would be computed as if this were any dataset:

$$\operatorname{V}(x) = \frac{1}{n}\sum (x_i-\bar x)^2.$$

A nice way to compute this value is to take half the average of all the squared differences:

$$\operatorname{V}(x) = \frac{1}{n(n-1)}\sum_{j \gt i} (x_j-x_i)^2.$$

When $x$ is viewed as the outcome of a random walk $X$ of $n$ steps, the expectation of this is

$$\mathbb{E}(\operatorname{V}(X)) = \frac{1}{n(n-1)}\sum_{j \gt i} \mathbb{E}(X_j-X_i)^2.$$

The differences are sums of iid variables,

$$X_j - X_i = W_{i+1} + W_{i+2} + \cdots + W_j.$$

Expand the square and take expectations. Because the $W_k$ are independent and have zero means, the expectations of all cross terms are zero. That leaves only terms like $W_k$, whose expectation is $\sigma^2$. Thus

$$\mathbb{E}\left((X_j - X_i)^2\right) =\mathbb{E}\left((W_{i+1} + W_{i+2} + \cdots + W_j)^2\right)= (j-i)\sigma^2.$$

It easily follows that

$$\mathbb{E}(\operatorname{V}(X)) = \frac{1}{n(n-1)}\sum_{j \gt i} (j-i)\sigma^2 = \frac{n+1}{6}\sigma^2.$$

The covariance between two independent realizations $x$ and $y$--again in the sense of datasets, not random variables--can be computed with the same technique (but it requires more algebraic work; a quadruple sum is involved). The result is that the expected square of the covariance is

$$\mathbb{E}(\operatorname{C}(X,Y)^2) = \frac{3n^6-2n^5-3n^2+2n}{480n^2(n-1)^2}\sigma^4.$$

Consequently the expectation of the squared correlation coefficient between $X$ and $Y$, taken out to $n$ steps, is

$$\rho^2(n) = \frac{\mathbb{E}(\operatorname{C}(X,Y)^2)}{\mathbb{E}(\operatorname{V}(X))^2} = \frac{3}{40}\frac{3n^3-2n^2+3n-2}{n^3-n} = \frac{9}{40}\left(1+O\left(\frac{1}{n}\right)\right).$$

Although this is not constant, it rapidly approaches a limiting value of $9/40$. Its square root, approximately $0.47$, therefore approximates the expected absolute value of $\rho(n)$ (and underestimates it).


I am sure I have made computational errors, but simulations bear out the asymptotic accuracy. In the following results showing the histograms of $\rho^2(n)$ for $1000$ simulations each, the vertical red lines show the means while the dashed blue lines show the formula's value. Clearly it's incorrect, but asymptotically it is right. Evidently the entire distribution of $\rho^2(n)$ is approaching a limit as $n$ increases. Similarly, the distribution of $|\rho(n)|$ (which is the quantity of interest) will approach a limit.

Figure

This is the R code to produce the figure.

f <- function(n){
  m <- (2 - 3* n + 2* n^2 -3 * n^3)/(n - n^3) * 3/40 
}
n.sim <- 1e4
par(mfrow=c(1,4))
for (n in c(3, 10, 30, 100)) {
  u <- matrix(rnorm(n*n.sim), nrow=n)
  v <- matrix(rnorm(n*n.sim), nrow=n)
  x <- apply(u, 2, cumsum)
  y <- apply(v, 2, cumsum)
  sim <- rep(NA_real_, n.sim)
  for (i in 1:n.sim)
    sim[i] <- cor(x[,i], y[,i])^2
  z <- signif(sqrt(n.sim)*(mean(sim) - f(n)) / sd(sim), 3)
  hist(sim,xlab="rho(n)^2", main=paste("n =", n), sub=paste("Z =", z))
  abline(v=mean(sim), lwd=2, col="Red")
  abline(v=f(n), col="Blue", lwd=2, lty=3)
}
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • My Monte-Carlo simulation based estimate of $E[\rho^2]$ for $T = 100$ is about .24 (which appears to agree with your results). I agree with your analysis here. You might be getting at how the OP come to his number (though I calculate about .42, not .56). – Matthew Gunn Jan 11 '17 at 00:45
  • If you can take repeated draws from $\Omega$, there's nothing particularly special about time-series analysis. Issues (eg. ergodicity, stationarity etc...) develop when you can only observe new values of $X$ by advancing time $t$ which I assumed was what the OP was trying to get at... (but maybe not). – Matthew Gunn Jan 11 '17 at 00:46
  • 1
    +1 but what is the intuition about why there is this positive asymptotic value $9/40$, whereas naively one would expect that if one takes two very long random walks they should have near-zero correlation, i.e. naively one would expect the distribution of correlations to shrink to zero as $n$ grows? – amoeba Jan 11 '17 at 22:34
  • @amoeba First, I don't fully believe the value of $9/40$, but I know it's close to correct. For the intuition, consider that two independent walks $X_t$ and $Y_t$ are a random walk $(X_t,Y_t)$ in two dimensions. Take *any* random scatterplot in 2D and measure its eccentricity somehow. It will be rare for it to be perfectly circular. Thus, we expect the mean eccentricity to be positive. That there is a limiting distribution for random walks merely reflects the self-similar "fractal" nature of this 2D walk. – whuber Jan 11 '17 at 22:40
  • The intuition that I have is as follows: any given random walk will not fluctuate around zero, it will tend to grow as $\sqrt{n}$, this is a very well-known result. So if we have two random walks, both will typically "grow" ("fan out") as a square-root and hence might very well have large absolute correlation coefficient, no matter what $n$ is. Do you think this intuition makes sense? Regarding 9/40: do you suspect a mistake in your formula for covariance? If not, what can be a source of the error? – amoeba Jan 11 '17 at 22:43
  • @amoeba I found the immediate source of the error--a typographical error in a calculation! However, even after fixing it I was unable to obtain results close to the simulation, suggesting there is a more basic error somewhere. Although it's a routine calculation, it's messy and being off by just $1$ in an index somewhere will ruin it. I haven't the time to fix it, unfortunately. I wrote this answer hastily, hoping only that it might cast some light on the meaning of the original question and suggest some ways of analyzing the situation. – whuber Jan 11 '17 at 22:46
  • @whuber +1 for trying to calculate the formula, the scatterplots and showing the relationship between least-squares and correlation coefficient :) that helped me a lot! – Adam Jan 14 '17 at 19:39
  • 2
    An asymptotic analysis of the issues discussed here may be found in [Phillips (1986), Theorem 1e](https://ideas.repec.org/p/cwl/cwldpp/757.html). – Christoph Hanck Jan 27 '17 at 16:30