4

Suppose I have an online stream of data points $x_i,y_i$, where $i=1,2,\dots$. I want to compute the Pearson correlation coefficient between the vectors $\vec x$ and $\vec y$.

But here is the catch. I receive the points one by one, and computing the correlation from scratch with each new point would be too slow (at some point I cannot even store all the points at once).

So let $\rho_N$ be the Pearson correlation up to the $N$'th data point. Is there a way to efficiently update this to $\rho_{N+1}$ when I receive the next data point? (Probably I have to store some additional intermediate quantities as I receive more points).

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
becko
  • 3,298
  • 1
  • 19
  • 36
  • 3
    A generalization of this question--efficient online multiple regression--is answered at https://stats.stackexchange.com/questions/6920. – whuber May 28 '19 at 14:51

3 Answers3

7

Recall the formula for the sample Pearson correlation between two vectors $x\in\mathbb{R}^n$ and $y\in\mathbb{R}^n$ (Eq. 3 in Wikipedia):

$$ r = \frac{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_{i=1}^n(x_i-\overline{x})^2}\sqrt{\sum_{i=1}^n(y_i-\overline{y})^2}} $$

We simply have to store and update the relevant quantities in this fraction:

  • $\overline{x}_{n+1}$ will contain the sample mean of $x_1, \dots, x_{n+1}$ (this is easily calculated online)
  • ditto for $\overline{y}_{n+1}$
  • $N_{n+1}=\sum_{i=1}^{n+1}(x_i-\overline{x})(y_i-\overline{y})$ will contain the numerator of $r$
  • $D_{n+1}=\sum_{i=1}^{n+1}(x_i-\overline{x})^2$ and $E_{n+1}=\sum_{i=1}^{n+1}(y_i-\overline{y})^2$ will contain the two components for the denominator.

Initialize:

$$ \overline{x}_0:=\overline{y}_0:=N_0:=D_0:=E_0:=0 $$

In updating, assume that $\overline{x}_n, \overline{y}_n, N_n, D_n, E_n$ are known, and that a new data pair $(x_{n+1}, y_{n+1})$ arrives. We update:

$$ \begin{array} \;\;\; \overline{x}_{n+1}:=& \frac{1}{n+1}(n\overline{x}_n+x_n) \\ \overline{y}_{n+1}:=& \frac{1}{n+1}(n\overline{y}_n+y_n) \\ N_{n+1}:=& N_n + (x_{n+1}-\overline{x}_{n+1})(y_{n+1}-\overline{y}_{n+1}) \\ D_{n+1}:=& D_n + (x_{n+1}-\overline{x}_{n+1})^2 \\ E_{n+1}:=& E_n + (y_{n+1}-\overline{y}_{n+1})^2. \end{array} $$

Then the correlation is

$$ r = \frac{N_{n+1}}{\sqrt{D_{n+1}}\sqrt{E_{n+1}}}. $$

Multihunter
  • 125
  • 8
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • The LHS of the first line of the update should be $\bar{x}_{n+1}$, shouldn't it? – Glen_b Oct 01 '19 at 13:10
  • @Glen_b: you are right, it should. And it is. At least in the source code of my answer. I have no idea why MathJax decided to eat the overline. (I believe I already saw this when I first wrote the answer, but didn't see anything to do about it.) I'm looking forward to the next typo of mine you unearth! – Stephan Kolassa Oct 01 '19 at 13:33
  • See @Marcos answer below. – becko Sep 13 '21 at 14:18
3

A few observations on Stephan Kolassa's answer:

Computing the incremental averages as follows is more numerically robust in practice: $$ \bar{x}_{n+1} = \bar{x}_{n} + \frac{x_{n+1} - \bar{x}_{n}}{n+1} $$ $$ \bar{y}_{n+1} = \bar{y}_{n} + \frac{y_{n+1} - \bar{y}_{n}}{n+1} $$

Note how this formulation avoids (potentially) large products such as $n\bar{x}_{n}$.

$$ \\ $$

Then there's a small error in the first factors of $N_{n+1}$, $D_{n+1}$ and $E_{n+1}$: $$ N_{n+1} = N_{n} + (x_{n+1} - \bar{x}_{n})(y_{n+1} - \bar{y}_{n+1}) $$ $$ D_{n+1} = D_{n} + (x_{n+1} - \bar{x}_{n})(x_{n+1} - \bar{x}_{n+1}) $$ $$ E_{n+1} = E_{n} + (y_{n+1} - \bar{y}_{n})(y_{n+1} - \bar{y}_{n+1}) $$

Note the $\bar{x}_n$ (and $\bar{y}_{n}$) in the first factor, instead of $\bar{x}_{n+1}$ (and $\bar{y}_{n+1}$). When there's not much variance in the data series, $\bar{x}_n$ and $\bar{x}_{n+1}$ tend to be nearly identical, so it wouldn't matter much. However, for all other scenarios, the difference can be substantial.

Some useful references:
https://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online

0

It's worth noting that the other answers here do not calculate the same thing as calculating $r$ at the end because the $\bar{x}$ value changes for each new point. On Wikipedia, there is a re-arrangement of the formula that allows "a convenient single-pass algorithm". I accidentally derived this formula myself before noticing it on Wikipedia.

$$ r_{xy} = \frac{\sum_i x_iy_i - n\bar{x}\bar{y}} {\sqrt{\sum_i x_i^2 - n\bar{x}^2}\sqrt{\sum_i y_i^2 - n\bar{y}^2}} $$

So, all you have to do is accumulate:

$$ \hat{x} := \sum^n_{i=1} x_i \;\;\;\;\;\;\; \hat{y} := \sum^n_{i=1} y_i $$ $$ \hat{a} := \sum^n_{i=1} x_i^2 \;\;\;\;\;\;\; \hat{b} := \sum^n_{i=1} y_i^2 \;\;\;\;\;\;\; \hat{c} := \sum^n_{i=1} x_iy_i $$

And plug those into the formula whenever you want $r$. For completeness here it is with my terminology for accumulated values:

$$ r = \frac{\hat{c} - \frac{\hat{x}\hat{y}}{n}}{\sqrt{\hat{a} - \frac{\hat{x}^2}{n}}\sqrt{\hat{b} - \frac{\hat{y}^2}{n}}} $$

Note: The $\hat{x}^2$ terms can get very large, so to use this method, you might need to handle overflow.

Multihunter
  • 125
  • 8