Which unsupervised classification method to use next if hierarchical clustering gave bad results?

Question

Purposes

I need to perform a classification of weather stations taking into account the characteristics of intra-annual variability of some two climate indicators. There are 613 sites with monthly averaged data, so the input array has shape 613 x 24.

The seasonal distribution visualized for each indicator separately is shown below: enter image description here

Usually it's possible to find dense groups of weather stations with similar intra-annual variations of indicators. Since there are a lot of classification (separation) methods, the most optimal one should be discovered and approved.

Methodology

Classification method: the hierarchical clustering methods seem to be a default solution for this case. I chose Ward and Complete Linkage methods, and the latter one was used with about 20 different metrics.

Classes amount: $1 \leq k \leq 50$

Estimation method: the quality of classification was estimated with the combination of $I_{r}$ (average intraclass dispersion measure) and $z_{r}$ (average element concentration measure) functions: $$F(k)=I_{r}\cdot\frac{1}{z_{r}}\rightarrow min$$ $$I_{r}^{(k)}=\left [ \frac{1}{k}\sum_{i=1}^{k}\frac{1}{n_{i}}\sum_{x_{p},x_{q}\in S_{i}}d_{pq}^{r}\right ]^{\frac{1}{r}}$$ $$z_{r}^{(k)}=\left [ \frac{1}{k}\sum_{i=1}^{k} \left ( \frac{n_{i}}{k}\right )^{r} \right ]^{\frac{1}{r}}$$ where $k$ is selected amount of classes, $n_{i}$ is amount of elements in a particular class, $r$ is exponent value (equal to 2 in my case), $S_{i}$ is a particular class for the given separation and $\sum_{x_{p},x_{q}\in S_{i}}d_{pq}^{r}$ is a sum of euclidean distances (raised to the $r$) between the elements of a particular class (from Мандель И. Д. Кластерный анализ. М.: Финансы и статистика, 1988. 176 с.).

$I_{r}$ characterizes the intraclass variation and it is a decreasing function of $k$. $\frac{1}{z_{r}}$ inhibits the tendency toward excessive detalization and increases with $k$. These opposite trends are "balanced" in quality functional $F(k)$. So the idea is to find the value $k$ corresponding to the global minimum of $F(k)$.

Results

Consider the $F(k)$ curves for different methods and metrics: enter image description here

It seems that:

Ward method unexpectedly provided the worst results;
Global minimum ($k = 5$, complete linkage, chebyshev distance) is very close to the minimum ($k = 1$) of the most of the $F(k)$ curves. Does it mean that it's worth to leave this dataset as it is with no subclasses distinguished?
The separation corresponding to the global minimum is degenerated because it provides one huge class with more than 99% of members and four one-element classes.

Question This short research suggests the idea of the absence of subclasses in the discussed dataset. But probably this case is just lying out of the scope of hierarchical clustering methods. There should be other unsupervised learning techniques that can confirm or deny this hypothesis.

Please share your experience and feel free to try this dataset on your own. Any help will be appreciated (especially Python solutions).

`where N is selected amount of classes` That's unclear. Could N be the total sample size and k be the number of classes? Next: `i` subscript indicates a class (cluster), am I right? Then what is $d_{ij}$ - distance between a point and a class? I see problems with your notation. Also, what is the meaning of of Zr measure? — ttnphns, Jul 22 '15 at 02:47
@ttnphns, thanks, that's a good point, I've updated the post. — Vitaly Isaev, Jul 22 '15 at 06:28
@Anony-Mousse, the data was standartized before clustering, what else can I do? — Vitaly Isaev, Jul 22 '15 at 06:28
There are at least three different ways you can standardize, for example. Also there seem to be missing values that can cause problems with distances. — Has QUIT--Anony-Mousse, Jul 22 '15 at 06:31
@ttnphns, please post an answer with you research, I would like to upvote it at least. — Vitaly Isaev, Jul 29 '15 at 21:01

Digio · Answer 1 · 2015-07-25T12:23:30.590

5

I loaded your data into R and applied hierarchical clustering with Ward's method, which gave 3 clean cut clusters for your stations (Fig.1). Then I applied Principal Component Analysis on the scaled data which revealed that 71% of the information is explained by the first two components (Fig.2). A biplot of the first two components shows you how months (direction of vectors in the plot) correlate to stations (scattered points) and to each other (Fig.3). I created a response vector based on the dendrogram of Fig.1 in order to see if the output of Hierarchical clustering is reflected in the principal component scatter plot of Fig.3. It comes out that the 3 clusters appear next to each other (Fig.4). I repeated the whole process using Complete Linkage and it gave me two main clusters instead of three (final result in Fig.5).

These are pretty good results and it doesn't look like you need another method. I guess this is not in agreement with what you found but I hope it helps.

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

edited Jul 25 '15 at 12:23

answered Jul 24 '15 at 12:22

Digio

2,427
12
18

1

What was the coefficient implied on the vertical axis of the Ward's method dendrogram shown in your analysis? Wards's dendrograms may look [misleadingly pleasant](http://stats.stackexchange.com/a/63549/3277). Other criterions besides observing the looks of the dendrogram might suggest other than 3 number of clusters. – ttnphns Jul 24 '15 at 18:07
1

I've done Ward clustering of these data (on all the 24 variables, 2x12 month, not on principal components) and I may say that when the data were not standardized there was, but weak, support for the 3-cluster solution. However, since the first 12 variables are very different in variance from the second 12 variables one has to standardize the variables first. After that, Ward clustering showed solutions among which there were no superior (I used Calinski-Harabasz and Davies-Bouldin criterions). I would conclude that the data have no cluster structure, at least by Ward's method. – ttnphns Jul 24 '15 at 18:54
Yes the dendrogram can look misleadingly pleasant, that's the whole point of combining it with a PCA. The three clusters appear right next to each other in a plot which explains 71% of the variance, this cannot be random. Complete-linkage clustering would give 2 main clusters instead of three (added it in fig.5). In both cases, there's definitely a pattern. – Digio Jul 25 '15 at 12:18
@Digio, thanks for your try. PCA is a good solution, but for me it seems like on `PC1`vs`PC2` plot there is a single unified cloud of points with several outlets. Probably third dimension could distinguish the clusters. – Vitaly Isaev Jul 26 '15 at 08:08
1

Vitaly you're welcome, it's been fun. I'm not sure if it's well understood though that PCA is not used for clustering here. The colours you see in the cloud of the PC1 vs PC1 scatter plot did not come from PCA but from HC. The fact that coloured points don't follow a random pattern means that HC results are also reflected across PC1. And this tells you that you have good reason to trust HC for this dataset. – Digio Jul 26 '15 at 09:11
Strangely, Ward's method in Python gives very different results. I will next try clustering using a SOM (Self Organising Map). Just to make sure I got it right the first time: your objective is to cluster the 24 monthly observations as they appear in the 613x24 matrix in your csv? Your seasonal distribution plots are in Russian and I can't make out the details between them. Are they supposed to be samples over two sequential years or what? In which case, why would you merge them horizontally? – Digio Jul 27 '15 at 13:02
why not using both PCA and hierarchical clustering in sequence (HCPC) with the excellent [FactoMineR](http://factominer.free.fr/classical-methods/hierarchical-clustering-on-principal-components.html) package? That tends to give more definite and stable clusters. – Antoine Jul 27 '15 at 15:13
@Digio, the hydrometeorological regime of each site is characterized with intra-annual variation of some two indicators. For each of them we have just month averages: `12` for the first one and `12` for the other one. We need to classify `613` sites taking into account the seasonal variation of both of these indicators. – Vitaly Isaev Jul 27 '15 at 19:10
That's what I had understood, it appears though that Python's Ward does not produce the same results as R's Ward. – Digio Jul 29 '15 at 13:13

theforestecologist · Answer 2 · 2018-05-24T14:07:59.893

You could try non-hierarchical partitioning.

If you're willing to play in R, you could try package optpart. You'll find a variation of k-means, pam (see Kaufman & Rousseeuw, 1990), as well as two methods of non-hierarchical partitioning, OPTPART and OPTSIL, available. http://cran.r-project.org/web/packages/optpart/optpart.pdf .

Functions silhouette and stride additionally help you decide on the proper number of partitions. These methods can potentially be used (with some modification) for hierarchical clustering methods as well.

A final method to look into is ISODATA (though I admittedly know little about it)

I don't think Dave Roberts has extended OPTPART in any way to Python, but I'm sure some version of pam can be utilized in python with little to no effort. OPTPART usually creates the best partitions for me, so it might be worth looking into. In the end, I'm not sure these methods will necessarily help you any better but, at the very least, they'll provide you with a handful of additional (often overlooked) clustering methods to try.

Hope this helps!

Which unsupervised classification method to use next if hierarchical clustering gave bad results?

2 Answers2