Between-cluster variance in k-means - derivation using total variance

Question

Follow-up to this older post (have to make it a question since I can't post comments yet).

Specifically, could anyone kindly show how $$\operatorname{Var}[\operatorname E[X\mid K]]$$ (in total variance method) is equivalent to $$\sum_k{n_k(\bar x_k - \bar x )^2}$$ (in the "more direct" method)?

When I try to do this from first principles (MIT course, bottom right slide of page 1), I end up with $$\sum_k\frac{n_k}{n}(\bar x_k - \bar x )^2$$ - which is the same "error/typo" that OP made... so there must be something I'm missing. Something about the "weight function"? But I can't see how the example in the slide is any different from this clustering case.

Thanks a lot in advance.

What you ended up with looks better to me. For example, if each $n_k=1$, you would want a $\frac1n$ term — Henry, Jun 03 '19 at 00:14
I guess then the question becomes, how is the example on the MIT slide different from k-means clustering setting? — Tim, Jun 03 '19 at 00:26
The question in your first link is talking about the "total sum of squares" of differences from the mean than the variance expected square of the difference from the mean. Hence the $\frac1n$ factor — Henry, Jun 03 '19 at 00:49

Tim · Answer 1 · 2019-06-03T00:57:01.027

0

OK, got it... Confusion arose from different definition of "total variation" in k-means problem $\sum_{i=1}^n(x_i - \bar x)^2$ and conventional definition which the slide (and in fact all our previous training) uses, that is $\frac1n\sum_{i=1}^n(x_i - \bar x)^2$... and the difference is exactly $n$.

edited Jun 03 '19 at 00:57

answered Jun 03 '19 at 00:51

Tim

131
4

Between-cluster variance in k-means - derivation using total variance

1 Answers1