13

I am creating a component that aims to calculate the average and variance of a metric associated with events happening during time but with a limited internal memory.

Imagine that the events are visitors entering in a shop and the metric is their age.

During time, my component receives events with the age of each visitor. I don't want my component to memorize the history of each ages. Ideally, I would like a light component storing only: the average A, the variance V and the number of events N.

After each event with age E, I want to update those three values :

N<=N+1
A<=(A*N+E)/(N+1)
V<=???

What for V? I am thinking of something like :

V<=(V*N+(E-A)^2)/(N+1)

I know it is not exact as my previous V is using the old A which is no more the average.

Q1 - Is there an exact formula?
Q2 - If not, is my proposal a good estimate? Is it biased? Will it converge correctly when N increases?
Q3 - Is there a better formula?

Tim
  • 108,699
  • 20
  • 212
  • 390
Arnaud Mégret
  • 546
  • 4
  • 13

3 Answers3

14

Nice and simple algorithm for computing variance in online manner was described by Welford (1962). Below you can see C++/Rcpp implementation of it that works offline, but can be easily adapted to online scenario:

List welford_cpp(NumericVector x) {

  int n = x.length();
  double delta;
  double msq = 0;
  double mean = x[0];

  if (n > 1) {
    for (int i = 1; i < n; i++) { 
      delta = x[i] - mean;
      mean += delta / (i+1);
      msq += delta * (x[i] - mean);
    }
    return Rcpp::List::create(Rcpp::Named("mean") = mean,
                              Rcpp::Named("variance") = msq / (n-1));
  }

  return Rcpp::List::create(Rcpp::Named("mean") = mean,
                            Rcpp::Named("variance") = NAN);
}

As you can see, it needs to store only four variables: n, delta, msq and mean and computes mean and variance simultaneously as you wanted.


Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.

Tim
  • 108,699
  • 20
  • 212
  • 390
1

The variance can be expressed as proportional to the squared difference between every value and the mean value, or (as many threads here in stats.SE documented, like this answer I wrote to another question) it can alternatively be expressed as proportional to the squared pairwise difference between every sample.

So we know:

$$\text {Var}(x) = \frac{1}{n} \cdot \sum_{i}(X_i-\overline{X})^2 = \frac{1}{2n^2} \cdot \sum_{i,j}(X_i-X_j)^2$$

Let's say you add another sample, indexed as the last index, $k$. Your previous variance would be:

$$\text {Var}_{old}(x) = \frac{1}{2(n-1)^2} \cdot \sum_{i<k,j<k}(X_i-X_j)^2$$

Your new variance is

$$\text {Var}_{new}(x) = \frac{1}{2n^2} \cdot \sum_{i,j}(X_i-X_j)^2 = \frac{1}{2n^2} \cdot \left( \sum_{i<k,j<k}(X_i-X_j)^2 + \sum_{j<k}(X_k-X_j)^2 + \sum_{i<k}(X_i-X_k)^2 \right)$$

But

$$\sum_{j<k}(X_k-X_j)^2 = \sum_{i<k}(X_i-X_k)^2\\ \sum_{i<k,j<k}(X_i-X_j)^2 = {2(n-1)^2} \cdot \text {Var}_{old}(x) $$

So

$$\text {Var}_{new}(x) = \left(\frac{n-1}{n}\right)^2 \text{Var}_{old}(x)+ \frac{1}{n^2} \sum_{j<k}(X_k-X_j)^2 $$

As @MarkL.Stone said in the comments, this still isn't efficient because we must keep every $X_i$. So, let's expand the formula to arrive at something more tractable.

$$\frac{1}{n^2}\sum_{j<k}(X_k-X_j)^2=\\ \frac{1}{n^2}\sum_{j<k}(X_k^2 - 2 \cdot X_j \cdot X_k + X_j^2)=\\ \frac{1}{n^2}\left(\sum_{j<k}X_k^2 - 2 \cdot X_k \cdot \sum_{j<k} X_j + \sum_{j<k} X_j^2\right)=\\ \frac{1}{n^2}\left(k\cdot X_k^2 - 2 \cdot X_k \cdot (k-1) \cdot \overline{X_{old}} + (k-1) \cdot \overline{X^2_{old}}\right)\\ $$ Because $$\sum_{j<k} X_j = (k-1) \cdot \overline{X_{old}}\\ \sum_{j<k} X_j^2 = (k-1) \cdot \overline{X^2_{old}}$$

The final form is then

$$\text {Var}_{new}(x) = \left(\frac{n-1}{n}\right)^2 \text{Var}_{old}(x)+ \frac{1}{n^2}\left(k\cdot X_k^2 - 2 \cdot X_k \cdot (k-1) \cdot \overline{X_{old}} + (k-1) \cdot \overline{X^2_{old}}\right) $$

You can use this formula to update the variance effectively memory-wise. You can also complement it to use batches instead of single point updates.

Basically you need to store the average, the average of the squared samples, and the variance every iteration, and use it to update the variance formula.


Further

$$\overline{X^2_{old}} = \text{Var}_{old}(x) + (\overline{X_{old}})^2\\ \therefore\text {Var}_{new}(x) = \left(\frac{n-1}{n}\right)^2 \text{Var}_{old}(x)+ \frac{1}{n^2}\left(k\cdot X_k^2 - 2 \cdot X_k \cdot (k-1) \cdot \overline{X_{old}} + (k-1) \cdot \left(\text{Var}_{old}(x) + (\overline{X_{old}})^2\right)\right) $$

Which brings the number of quantities that need to be stored down to 2.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • Doesn't this method require the availability of all previous data points in order to compute the update? If so, that is in contravention to the idea of dealing with limited memory. Note that online updating algorithms, along the lines of Welford's in the answer by @Tim , which is a particular instance of a class of similar algorithms discussed in cs.yale.edu/publications/techreports/tr222.pdf , do not require saving old data points, but only 2 registers (scalar variables) to retain old information. – Mark L. Stone Sep 15 '16 at 19:46
  • @MarkL.Stone Hmm I see. Yes, this requires all the previous values $X_i$, you are right. – Firebug Sep 15 '16 at 19:49
  • @MarkL.Stone I updated the formula so three scalars need to be stored. I can already see it can be further reduced, perhaps being equivalent to the other solution. – Firebug Sep 15 '16 at 20:26
  • 1
    Due to the subtraction, rather than only adding non-negative quantities, your revised algorithm is less numerically accurate (robust) than the Welford and similar algorithms. I see no merit in it whatsoever. – Mark L. Stone Sep 15 '16 at 20:54
0

OK Andy W gave the answer. By conserving the $E^2$ average in the same way as the E average, you can use $V = exp(E^2)-exp(E)^2$.

Arnaud Mégret
  • 546
  • 4
  • 13
  • 2
    By $exp(E^2)$ do you perhaps mean the *expected value* of $E^2$? (And not the exponential function.) – Andy W Sep 15 '16 at 13:46
  • 9
    That method is fine, unless you care about getting the right answer. – Mark L. Stone Sep 15 '16 at 13:53
  • Yes exp(ectation). And why won't the answer be correct ? Is it a mathematic problem or a computational one ? – Arnaud Mégret Sep 15 '16 at 16:32
  • 3
    Numerical instability, and therefore, numerical inaccuracy. It is correct if performed in exact arithmetic, i.e., infinite precision. In finite precision on a computer, it can be very inaccurate, and can even come out negative (and actually has on many occasions).. – Mark L. Stone Sep 15 '16 at 16:44
  • Well, I trust you. But, I am inclined to make the following simple reasoning : the relative error on an average is identical or less than the error on the value. So, the difference between 2 averages should not be really inaccurate unless close to 0. – Arnaud Mégret Sep 15 '16 at 16:51
  • 4
    Excel actually used this method for a long time (to much criticism and derision from statisticians and others). In quite simple circumstances (data with large mean, small standard deviation) you could make its variance function give output that corresponded to an approximation of a random number generator (shift the data by successive small amounts and the reported variance jumped around dramatically). This was caused by catastrophic cancellation of the difference. It was a very effective way to demonstrate why these issues matter. Excel doesn't do that any more. – Glen_b Sep 15 '16 at 19:07
  • 4
    On this catastrophic cancellation, see for example the discussion [here](https://en.wikipedia.org/wiki/Variance#Formulae_for_the_variance) – Glen_b Sep 15 '16 at 19:12