Objective for Ward's agglomerative clustering

Question

As far as I understood the original article (Ward, J. H. (1963). Hierarchical grouping to optimize an objective function), Ward proposed the following criterion for agglomerative clustering.

In each step of the algorithm we should find two clusters $C_i$ and $C_j$ such that after merging them into the new cluster $C_k = C_i \cup C_j$ the quantity

$$\mathrm{ESS}(C_k) - \mathrm{ESS}(C_i) - \mathrm{ESS}(C_j) = |C_k| \mathrm{Var}(C_k) - |C_i| \mathrm{Var}(C_i) - |C_j| \mathrm{Var}(C_j) \\= \sum_{m \in K_{\text{after}}}|C_m| \mathrm{Var}(C_m) - \sum_{n \in K_{\text{before}}}|C_n| \mathrm{Var}(C_n) \\= \text{"K-means objective after merging"} ~~-~~ \text{"K-means objective before merging"}$$

should be minimal.
(here ESS stands for "error sum of squares": $\mathrm{ESS}(C_k) = \sum_{i: x_i \in C_k} (x_i - \mu_k)^2$; $K_{\text{after}}$ – set of clusters after merging, $K_{\text{before}}$ – set of clusters before merging).

But wikipedia (and some other sources) says that:

Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.

Also sklearn manual (for sklearn.cluster.AgglomerativeClustering) says the following about 'ward' linkage:

‘ward’ minimizes the variance of the clusters being merged.

This means that in each step of the algorithm we should find two clusters $C_i$ and $C_j$ such that after merging them into the new cluster $C_k = C_i \cup C_j$ the quantity

$$\mathrm{Var}(C_k) - \mathrm{Var}(C_i) - \mathrm{Var}(C_j) = \sum_{m \in K_{\text{after}}} \mathrm{Var}(C_m) - \sum_{n \in K_{\text{before}}}\mathrm{Var}(C_n)$$ should be minimal.

Obviously, Wikipedia/sklearn's interpretation is different from mine. So I want to know which interpretation of Ward's agglomerative clustering is correct - mine or from Wikipedia/sklearn.

Your understanding seems fine. My take is that Wikipedia, and even worse sklearn, just use sloppy wording, but if you ask the authors they'd probably say they mean the same thing. — Christian Hennig, Feb 15 '21 at 11:42
Wikipedia is _wrong_ saying that Ward's method is "minimum variance". See Talk page on the article which clearly points that out. See also my answer [here](https://stats.stackexchange.com/a/217742/3277). — ttnphns, Feb 15 '21 at 18:06
The terminologic error is due to the fact that the true "minimum variance" linkage method is not widely used, while Ward's is very popular, and the notion of "variance" is easier to keep in memory than more correct "minimum increase of SS". — ttnphns, Feb 15 '21 at 18:11

Objective for Ward's agglomerative clustering

0 Answers0