How is this formula for variance correct? (as seen in stats software source code)

Question

I'm trying to re-implement the Ckmeans.1D.DP algorithm in Python. I have the actual dynamic optimization part down, but I'm a little confused by the BIC computation they use for selecting the number of clusters. I'm not concerned right now about whether this is the best technique; I'm just trying to achieve parity with their implementation.

In particular, this variance computation (performed inside each cluster) is baffling:

double mean = 0.0;
double variance = 0.0;

for (size_t i = indexLeft; i <= indexRight; ++i) {
  mean += x[i];
  variance += x[i] * x[i];
}
mean /= numPointsInBin;

if (numPointsInBin > 1) {
  variance = (variance - numPointsInBin * mean * mean) / (numPointsInBin - 1);
} else {
  variance = 0;
}

(source)

I read this as follows: $$\begin{align} \hat \mu_k &= \frac{1}{n_k} \sum_{i=i_\text{left}}^{i_\text{right}} x_i \\ \hat \sigma^2_k &= \begin{cases} \frac{1}{n_k - 1} \sum_{i=1_\text{left}}^{i_\text{right}} \left( x_i^2 - \hat \mu_k^2 \right) &,\ n_k > 1 \\ 0 &,\ n_k = 1 \end{cases} \end{align}$$

where $\hat \mu_k$ is a cluster mean, $\hat \sigma^2_k$ is a cluster variance, $n_k$ is a cluster size, and $k$ indexes the clusters.

This is totally bizarre: the maximum likelihood estimator for variance is $\frac{1}{n - 1} \sum_i \left(x_i - \hat \mu\right)^2$ (for generic $n$, $i$, and $\hat \mu$), and it's pretty clear that $\left(x_i - \hat \mu\right)^2 = x_i^2 - 2 x_i \hat\mu + \hat\mu^2 \neq xi^2 - \hat\mu^2$. What gives?

This is just the standard garden variety textbook one-pass formula for computing sample variance with a denominator of (n-1); with the meaning of the variance variable changing along the way to make it a little more confusing..Note that MLE estimator actually has n, not n-1, in the denominator. On your "pretty clear" thing in your penultimate sentence, if you take expectations, you'll see it comes out o.k. This is a numerically unstable (potentially inaccurate) way of calculating sample variance, but would be fine if all calculations were done in exact arithmetic, i.e. with no roundoff error. — Mark L. Stone, May 02 '16 at 17:08
Although this result is difficult to [search for](http://stats.stackexchange.com/search?q=sum+zero+squares+regression), it has been shown in dozens of threads. The first relevant hit turned up by that search is http://stats.stackexchange.com/questions/97746, where a generalization (to regression involving more than just a constant, as here) is proven in several ways. But since you don't recognize the formula for $\hat\sigma^2_k$ as the standard definition of a sample variance, exactly what do you understand the variance to be? — whuber, May 02 '16 at 17:18
Given that this numerically unstable code is used in an archival submission (CRAN), it is quite disconcerting, and contributes to the evidence that you can't automatically trust anything in CRAN. The package authors should learn basic numerical analysis. They would then realize how unsuitable their code is for production use. They could have used a numerically stable one-pass algorithm. ... Although it least this isn't being done in single precision on a GPU with billions of data points, as some folks do. — Mark L. Stone, May 02 '16 at 17:20
@whuber I figured it out, I just went out to lunch before I posted my answer :) — shadowtalker, May 02 '16 at 17:31
Also @MarkL.Stone thanks for the comments about numerical instability. What's even more bizarre is that they _already compute the within-cluster sum of squares_ when they backtrack the solution out of the dynamic programming matrix, so as far as I can tell they could have just done `within_ss / (numPointsInBin - 1)` — shadowtalker, May 02 '16 at 17:32
Your reading of the code appears to be incorrect. The code subtracts $n$ times the square of the mean *outside* the loop. Inside the loop it sums the $x_i$ and the $x_i^2$. This is the usual formula offered in many elementary textbooks--but it's not the one you wrote down. — whuber, May 02 '16 at 18:23
@whuber you're right, that's not what the code says, but they're algebraically equivalent. I mistakenly assumed that I would need to pull $n\hat\mu^2$ inside the summation to understand the calculation. — shadowtalker, May 02 '16 at 18:35
Algebraic equivalence is not the same as computational equivalence. The distinction is important, as @MarkL.Stone has pointed out. The formula you wrote is more stable numerically than the one actually used in the code. — whuber, May 02 '16 at 18:37
@whuber that's a little surprising. But as I mentioned in the comment above, isn't this computation unnecessary anyway if I already know the sum of squared deviations from the mean within the cluster? I can just divide that by $n - 1$, right? — shadowtalker, May 02 '16 at 18:53
Yes, that's correct. As far as the claims of stability go, it comes down to something I'm sure you know. If you wanted (say) to compute the variance of the data $1,2,3$, you could do so accurately. You also know it's the same as the variance of the data $1000000001,1000000002, 1000000003$. However, the squares of all those numbers (as double precision floats) are large and subtracting equally large numbers from them can lose all the precision inherent in the format. It happens all the time when clustering geographic coordinates! — whuber, May 02 '16 at 19:17

score 1 · Accepted Answer · answered May 02 '16 at 17:30

The answer (which I discovered in the process of writing the question) is to look at the forest, not the trees!

Expand the sum: $$\begin{align} \sum_i \left( x_i - \hat\mu \right)^2 &= \sum_i \left( x_i^2 - 2\hat\mu x_i + \hat\mu^2 \right) \\ &= \sum_i x_i^2 - 2\hat\mu\sum_i x_i + n\hat\mu^2 \end{align}$$

Then the trick is to realize that $\sum_i x_i = \frac{n}{n}\sum_i x_i = n\hat\mu$:

$$\begin{align} \sum_i \left( x_i^2 \right) - 2\hat\mu \cdot n\hat\mu + n\hat\mu^2 &= \sum_i \left( x_i^2 \right) - 2\hat\mu^2 + n\hat\mu^2 \\ &= \sum_i \left( x_i^2 \right) - n\hat\mu^2 \\ &= \sum_i \left( x_i^2 - \hat\mu^2 \right) \end{align}$$

which is exactly the "baffling" numerator I encountered in the code.

How is this formula for variance correct? (as seen in stats software source code)

1 Answers1

Linked