1

When two variables are non-stationary, there is a very high probability that they are strongly correlated to each other. For example, $x_t$ and $y_t$ were generated independently and randomly using different seed values. $$x_t = x_{t-1} + \epsilon_t$$ $$y_t = y_{t-1} + u_t$$

set.seed(123)
x <- cumsum(rnorm(100,10,10))

set.seed(321)
y <- cumsum(rnorm(100,5,5))

cor(x,y)^2

[1] 0.989947

I can repeat this 10,000 times with different seed values, means, and standard deviations. The correlations are extremely high almost all the time. What is driving this spurious correlation?

spuriousRegression <- function(seed1=123, seed2=321, mean1=10, mean2=5, sd1=10, sd2=5, nsize=100){

  set.seed(seed1)
  x <- cumsum(rnorm(nsize,mean1,sd1))  ### non-stationary

  set.seed(seed2)
  y <- cumsum(rnorm(nsize,mean2,sd2))  ### non-stationary

  lm.mod <- lm(y ~ x)
  summary.mod <- summary(lm.mod)

  return(summary.mod$r.squared)

}

spuriousRegression()

repeatKTimes <- mapply(spuriousRegression, seed1=1:10000, seed2=10000:1,
                       mean1=0:9999, mean2=9999:0, sd1=1:10000, sd2=1:10000)

summary(repeatKTimes) 

     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000007 0.934600 0.983500 0.893100 0.992200 0.999100 
William Chiu
  • 614
  • 5
  • 15
  • 1
    Two key concept to understand are that: (1) for non-stationary processes, time-invariant moments (eg. $\operatorname{E}[X]$ etc...) do not exist (2) For averages over *time* to converge to averages over *space*, you need [stationarity and ergodicity](https://en.wikipedia.org/wiki/Ergodic_theory#Ergodic_theorems). For any point in time $t$, the correlation between the random variables $X_t$ and $Y_t$ is indeed zero, **but you cannot estimate that number by taking time-series averages**. – Matthew Gunn May 20 '17 at 05:21
  • 1
    (1) For any time $t$, $\operatorname{Corr}(X_t, Y_t) = 0$. But (2) a time-invariant correlation between the processes $\operatorname{Corr}(X, Y)$ does not exist. And (3) a time-series average (as is used when estimating `lm(y ~ x)`) will not converge on averages over space because the assumptions of the ergodic theorem are not satisfied. The process is not stationary. – Matthew Gunn May 20 '17 at 05:29
  • Would you happen to have a paper that shows (analytically) that the $R^2$ are severely inflated in the presence of random walks? – William Chiu May 20 '17 at 08:54
  • I also wanted to get your thoughts on this question: https://stats.stackexchange.com/questions/280674/can-i-trust-the-r2-from-an-ols-regression-of-two-cointegrated-series – William Chiu May 20 '17 at 08:55

0 Answers0