When two variables are non-stationary, there is a very high probability that they are strongly correlated to each other. For example, $x_t$ and $y_t$ were generated independently and randomly using different seed values. $$x_t = x_{t-1} + \epsilon_t$$ $$y_t = y_{t-1} + u_t$$
set.seed(123)
x <- cumsum(rnorm(100,10,10))
set.seed(321)
y <- cumsum(rnorm(100,5,5))
cor(x,y)^2
[1] 0.989947
I can repeat this 10,000 times with different seed values, means, and standard deviations. The correlations are extremely high almost all the time. What is driving this spurious correlation?
spuriousRegression <- function(seed1=123, seed2=321, mean1=10, mean2=5, sd1=10, sd2=5, nsize=100){
set.seed(seed1)
x <- cumsum(rnorm(nsize,mean1,sd1)) ### non-stationary
set.seed(seed2)
y <- cumsum(rnorm(nsize,mean2,sd2)) ### non-stationary
lm.mod <- lm(y ~ x)
summary.mod <- summary(lm.mod)
return(summary.mod$r.squared)
}
spuriousRegression()
repeatKTimes <- mapply(spuriousRegression, seed1=1:10000, seed2=10000:1,
mean1=0:9999, mean2=9999:0, sd1=1:10000, sd2=1:10000)
summary(repeatKTimes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000007 0.934600 0.983500 0.893100 0.992200 0.999100