2

I'm trying to understand the gap statistic used for optimal choice of $k$ in k-means clustering. I'm trying to understand part of the explanation which includes this equality:

$D_k=\sum_{ij}\Vert x_i-x_j\Vert^2=2n\sum_{i}\Vert x_i-\mu\Vert^2$

Where $\sum_{ij}\Vert x_i-x_j\Vert^2$ is the sum of pairwise distances in a cluster and $\mu$ is the centroid for that cluster. I don't understand how that is derived.

I've had a look at this question (Link between variance and pairwise distances within a variable) and the answer makes sense to me.

Assuming our data is centered about the (known) mean, which it is, by definition of the centroid, then the equality holds:

$\sum_{ij}(X_i-X_j)^2=2\sum X_i^2$

But I don't see how $2\sum X_i^2 = 2n\sum_{i}\Vert x_i-\mu\Vert^2$

If $2\sum X_i^2 = 2\operatorname{Var}(X)$

Then wouldn't $2\sum X_i^2 = \frac{2}{n}\sum_i(x_i-\mu)^2$?

Daniel
  • 145
  • 9

1 Answers1

3

To show that $$\sum\limits_{i=1}^n\sum\limits_{j=1}^{n}\left\|x_i - x_j\right\|^2 =2n\sum\limits_{i=1}^{n}\left\|x_i- \mu\right\|^2,$$ firstly note that $$\mu = \frac{1}{n}\sum\limits _{j =1}^{n}x_j.$$ Hence $$\begin{align*} 2n\sum\limits_{i=1}^{n}\left\|x_i- \mu\right\|^2 &= 2n\sum\limits_{i=1}^{n}\left\|x_i- \frac{1}{n}\sum\limits _{j =1}^{n}x_j\right\|^2 \\ &= 2n \sum\limits_{i=1}^{n}\left( \left\|x_i\right\|^2 -2x_i^T\left(\frac{1}{n}\sum\limits _{j =1}^{n}x_j\right) + \left\| \frac{1}{n}\sum\limits _{j =1}^{n}x_j\right\|^2 \right) \quad (\text{using }\left\|a-b\right\|^2 = \left\|a\right\|^2 -2a^Tb + \left\|b\right\|^2) \\ &= 2n \sum\limits_{i=1}^{n}\left\|x_i\right\|^2 -4 \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j + 2n^2 \left\| \frac{1}{n}\sum\limits _{j =1}^{n}x_j\right\|^2 \\ &= 2n \sum\limits_{i=1}^{n}\left\|x_i\right\|^2 -4 \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j + 2\left\| \sum\limits _{j =1}^{n}x_j\right\|^2 \\ &= 2n \sum\limits_{i=1}^{n}\left\|x_i\right\|^2 -4 \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j + 2\sum\limits_{i=1}^{n}\sum\limits _{j=1}^{n}x_i^Tx_j \\ &= 2n \sum\limits_{i=1}^{n}\left\|x_i\right\|^2 -2\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j \\ &= n \sum\limits_{i=1}^{n}\left\|x_i\right\|^2 -2\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j + n\sum\limits_{j=1}^{n}\left\|x_j\right\|^2 \\ &= \sum\limits_{i=1}^{n}\sum\limits _{j=1}^{n}\left\|x_i\right\|^2 -2\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j + \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\left\|x_j\right\|^2 \\ &= \sum\limits_{i=1}^{n}\sum\limits _{j=1}^{n} \left(\left\|x_i\right\|^2 -2x_i^T x_j + \left\|x_j\right\|^2\right) \\ &= \sum\limits_{i=1}^n\sum\limits_{j=1}^{n}\left\|x_i - x_j\right\|^2, \end{align*}$$ as required.


Note that we made use of several algebra facts along the way, such as $$\left\|\sum\limits_{j=1}^n x_j\right\|^2 = \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n}x_i^T x_j,$$ and that summing something that does depend on one of the summation indices introduces a factor of $n$, e.g. $$\sum\limits_{i=1}^{n}\left\|\frac{1}{n}\sum\limits_{j=1}^n x_j\right\|^2 = n \left\|\frac{1}{n}\sum\limits_{j=1}^n x_j\right\|^2$$ and $$n\sum\limits_{j=1}^{n}\left\|x_j\right\|^2 = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\left\|x_j\right\|^2.$$