As far as I understood the original article (Ward, J. H. (1963). Hierarchical grouping to optimize an objective function), Ward proposed the following criterion for agglomerative clustering.
In each step of the algorithm we should find two clusters $C_i$ and $C_j$ such that after merging them into the new cluster $C_k = C_i \cup C_j$ the quantity
$$\mathrm{ESS}(C_k) - \mathrm{ESS}(C_i) - \mathrm{ESS}(C_j) = |C_k| \mathrm{Var}(C_k) - |C_i| \mathrm{Var}(C_i) - |C_j| \mathrm{Var}(C_j) \\= \sum_{m \in K_{\text{after}}}|C_m| \mathrm{Var}(C_m) - \sum_{n \in K_{\text{before}}}|C_n| \mathrm{Var}(C_n) \\= \text{"K-means objective after merging"} ~~-~~ \text{"K-means objective before merging"}$$
should be minimal.
(here ESS stands for "error sum of squares": $\mathrm{ESS}(C_k) = \sum_{i: x_i \in C_k} (x_i - \mu_k)^2$; $K_{\text{after}}$ – set of clusters after merging, $K_{\text{before}}$ – set of clusters before merging).
But wikipedia (and some other sources) says that:
Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.
Also sklearn manual (for sklearn.cluster.AgglomerativeClustering) says the following about 'ward' linkage:
‘ward’ minimizes the variance of the clusters being merged.
This means that in each step of the algorithm we should find two clusters $C_i$ and $C_j$ such that after merging them into the new cluster $C_k = C_i \cup C_j$ the quantity
$$\mathrm{Var}(C_k) - \mathrm{Var}(C_i) - \mathrm{Var}(C_j) = \sum_{m \in K_{\text{after}}} \mathrm{Var}(C_m) - \sum_{n \in K_{\text{before}}}\mathrm{Var}(C_n)$$ should be minimal.
Obviously, Wikipedia/sklearn's interpretation is different from mine. So I want to know which interpretation of Ward's agglomerative clustering is correct - mine or from Wikipedia/sklearn.