73

I have 2 time-series (both smooth) that I would like to cross-correlate to see how correlated they are.

I intend to use the Pearson correlation coefficient. Is this appropriate?

My second question is that I can choose to sample the 2 time-series as well as I like. i.e. I can choose how many data points I will us. Will this affect the correlation coefficient that is output? Do I need to account for this?

For illustration purposes

option(i)

[1,    4,    7,    10] & [6,    9,    6,    9,    6]

option(ii)

[1,2,3,4,5,6,7,8,9,10] & [6,7,8,9,8,7,6,7,8,9,8,7,6]  
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user1551817
  • 1,007
  • 1
  • 8
  • 11

3 Answers3

97

Pearson correlation is used to look at correlation between series ... but being time series the correlation is looked at across different lags -- the cross-correlation function.

The cross-correlation is impacted by dependence within-series, so in many cases the within-series dependence should be removed first. So to use this correlation, rather than smoothing the series, it's actually more common (because it's meaningful) to look at dependence between residuals - the rough part that's left over after a suitable model is found for the variables.

You probably want to begin with some basic resources on time series models before delving into trying to figure out whether a Pearson correlation across (presumably) nonstationary, smoothed series is interpretable.

In particular, you'll probably want to look into the phenomenon here. [In time series this is sometimes called spurious correlation, though the Wikipedia article on spurious correlation takes a narrow view on the use of the term in a way that would seem to exclude this use of the term. You'll probably find more on the issues discussed here by searching spurious regression instead.]

[Edit -- the Wikipedia landscape keeps changing; the above para. should probably be revised to reflect what's there now.]

e.g. see some discussions

  1. http://www.math.ku.dk/~sjo/papers/LisbonPaper.pdf (the opening quote of Yule, in a paper presented in 1925 but published the following year, summarizes the problem quite well)

  2. Christos Agiakloglou and Apostolos Tsimpanos, Spurious Correlations for Stationary AR(1) Processes http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.611.5055&rep=rep1&type=pdf (this shows that you can even get the problem between stationary series; hence the tendency to prewhiten)

  3. The classic reference of Yule, (1926) [1] mentioned above.

You may also find the discussion here useful, as well as the discussion here

--

Using Pearson correlation in a meaningful way between time series is difficult and sometimes surprisingly subtle.


I looked up spurious correlation, but I don't care if my A series is the cause of my B series or vice versa. I only want to know if you can learn something about series A by looking at what series B is doing (or vice versa). In other words - do they have an correlation.

Take note of my previous comment about the narrow use of the term spurious correlation in the Wikipedia article.

The point about spurious correlation is that series can appear correlated, but the correlation itself is not meaningful. Consider two people tossing two distinct coins counting number of heads so far minus number of tails so far as the value of their series.

(So if person 1 tosses $\text{HTHH...}$ they have 3-1 = 2 for the value at the 4th time step, and their series goes $1, 0, 1, 2,...$.)

Obviously there's no connection whatever between the two series. Clearly neither can tell you the first thing about the other!

But look at the sort of correlations you get between pairs of coins:

enter image description here

If I didn't tell you what those were, and you took any pair of those series by themselves, those would be impressive correlations would they not?

But they're all meaningless. Utterly spurious. None of the three pairs are really any more positively or negatively related to each other than any of the others -- its just cumulated noise. The spuriousness isn't just about prediction, the whole notion of of considering association between series without taking account of the within-series dependence is misplaced.

All you have here is within-series dependence. There's no actual cross-series relation whatever.

Once you deal properly with the issue that makes these series auto-dependent - they're all integrated (Bernoulli random walks), so you need to difference them - the "apparent" association disappears (the largest absolute cross-series correlation of the three is 0.048).

What that tells you is the truth -- the apparent association is a mere illusion caused by the dependence within-series.

Your question asked "how to use Pearson correlation correctly with time series" -- so please understand: if there's within-series dependence and you don't deal with it first, you won't be using it correctly.

Further, smoothing won't reduce the problem of serial dependence; quite the opposite -- it makes it even worse! Here are the correlations after smoothing (default loess smooth - of series vs index - performed in R):

            coin1      coin2     
coin2   0.9696378 
coin3  -0.8829326 -0.7733559 

They all got further from 0. They're all still nothing but meaningless noise, though now it's smoothed, cumulated noise. (By smoothing, we reduce the variability in the series we put into the correlation calculation, so that may be why the correlation goes up.)

[1]: Yule, G.U. (1926) "Why do we Sometimes get Nonsense-Correlations between Time-Series?" J.Roy.Stat.Soc., 89, 1, pp. 1-63

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thank you for the great answer. I looked up spurious correlation, but I don't care if my A series is the cause of my B series or vice versa. I only want to know if you can learn something about series A by looking at what series B is doing (or vice versa). In other words - do they have an correlation. – user1551817 Jan 14 '15 at 00:09
  • Please see my updated answer. – Glen_b Jan 14 '15 at 01:55
  • 1
    Let me just add that relations between multivariate time series can be studied using [cointegration](https://en.wikipedia.org/wiki/Cointegration). In this framework the random walks above are *independent* integrated (or cumulative) noise, but the framework allows for dependent integration, or cointegration, as well. Cointegration is more appropriate than correlation for studying dependencies between non-stationary time series, e.g. time series that contain random walk components. – NRH Aug 25 '15 at 07:26
  • @NRH Thanks; cointegration is important to mention; I should include a section on that, because with cointegrated series things run differently to what I've stated above. However, there are more forms of nonstationarity than integrated and cointegrated series; those are certainly useful, especially in particular contexts, but they don't cover everything. (My Bernoulli) random walks were chosen because they're simple to generate, not because they're intended to be particularly representative.) – Glen_b Aug 25 '15 at 09:37
  • 2
    "..so you need to difference them.." what does it mean exactly? Perhaps differentiating them?.. – George Pligoropoulos Jul 21 '17 at 21:11
  • In addition, this part "..more common (because it's meaningful) to look at dependence between residuals.." means that we take the residuals (output of model - real values) and then having these two time series of residuals take the correlation of them ?.. – George Pligoropoulos Jul 21 '17 at 21:12
  • 2
    Differencing - see Wikipedia [here](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average#Differencing) or [this section](https://www.otexts.org/fpp/8/1) of the book *Forecasting, Principles and Practice*. On your subsequent question, the remainder of the paragraph you quote is quite explicitly saying so. (It's not the only possibility, though, just describing one reasonably common thing that's done) – Glen_b Jul 21 '17 at 22:09
  • @Glen_b Thank you for your answer. The rest of the paragraph is "the rough part that's left over after a suitable model is found for the variables" But aren't there models that could fit perfectly the training set and thus make the residual be zero or near zero? Some elaboration inside the answer would be helpful – George Pligoropoulos Jul 24 '17 at 03:45
  • That's encompassed by the word *suitable*. – Glen_b Jul 24 '17 at 06:56
  • @Glen_b Could you please provide the name of the paper on the second reference? I can't find it anymore, the link is dead, and I don't know which papers of the author this is referencing. – Yuri-M-Dias Oct 26 '18 at 16:36
  • 1
    I have located what seems to be another version of the paper, and added title and authors – Glen_b Oct 26 '18 at 18:04
  • @Glen_b: First of all, thank you for a great answer. Secondly, I believe you recommend differencing the series not just once but as many times as it takes until we get a stationary series. Is that so? If Series A is I(1) and Series B is I(2), would the correlation between the once differenced Series A and twice-differenced Series B be meaningful? The other approach to establishing correlation between time series - cointegration, is not meaningful between series that have different orders of integration. – ColorStatistics Jan 24 '19 at 00:08
  • Not necessarily just differencing\*\*, but a model for the dependence within the series; specifically, you seek to remove any autocorrelation structure. $\quad$ \*\* -- and I'd warn against differencing "many times" -- in general if you need to difference more than twice, with perhaps a seasonal one in there if needed, you should probably be looking at something other than just differencing (possibly transformation, or perhaps adjusting for other variables or whatever else is producing the structure you have) – Glen_b Sep 29 '20 at 00:31
14

To complete the answer of Glen_b and his/her example on random walks, if you really want to use Pearson correlation on this kind of time series $(S_t)_{1 \leq t \leq T}$, you should first differentiate them, then work out the correlation coefficient on the increments ($X_t = S_t - S_{t-1}$) which are (in the case of random walks) independent and identically distributed. I suggest you to use the Spearman correlation or the Kendall one, as they are more robust than the Pearson coefficient. Pearson measures linear dependence whereas Spearman and Kendall measure are invariant by monotonous transforms of your variables.

Also, imagine that two time series are strongly dependent, say moves up together and goes down together, but one undergoing sometimes strong variations and the other one having always mild variations, your Pearson correlation will be rather low unlike the Spearman and Kendall ones (which are better estimates of dependence between your time series).

For a thorough treatment on this and a better understand of dependency, you can look at Copula Theory, and for an application to time series.

Kuma
  • 437
  • 2
  • 7
  • 18
mic
  • 3,848
  • 3
  • 23
  • 38
8

Time series data is usually dependent on time. Pearson correlation, however, is appropriate for independent data. This problem is similar to the so called spurious regression. The coefficient is likely to be highly significant but this comes only from the time trend of the data that affects both series. I recommend to model the data and then try to see whether the modelling produces similar results for both series. Using Pearson correlation coefficient, however, will most likely give misleading results for the interpretation of the dependence structure.

random_guy
  • 2,262
  • 1
  • 18
  • 30
  • can you elaborate on " Pearson correlation is appropriate for independent data"? as far as I know, for independent variables Pearson correlation would just be zero (in a sense that you don't need to carry out a Pearson correlation.) – stucash Mar 11 '20 at 16:25
  • I believe random_guy meant _within-series_ dependence. Given two series $X = x_1, ..., x_n$ and $Y = y_1, ..., y_n$, Pearson correlation assumes independence between $x_i, x_j$ (or $y_i, y_j$). However, this is usually not the case, because time-series are dependent on time and often have a trend. This trend can artificially inflate the correlation coefficient, especially if the trend itself correlates between $X$ and $Y$ (say, both go up). – ehudk Oct 22 '20 at 08:39