strategy to determine number of groupings kmeans

Question

Can you offer advice on a clustering strategy?

I have 4 continuous variables and I would like to perform cluster analysis.

The correlation matrix of the varaibles is

           Var1        Var2         Var3        Var4
Var1      1.0000000    0.32        -0.46        0.24
Var2        0.4       1.0          -0.29       0.0
Var3       -0.46       -0.4         1.0         0.36
var4       0.27        0.01         0.30        1.0

No variables are highly correlated so I include all 4.

First I perform clustering on the raw data by:

(1) center and scale the variables (2) perform various clustering techniques kmeans, pam, and agglomerative heirarcial with various Ks specified (3) examine the average silhouette width

The top average silhouetter widths for the different techniques are below:

Kmeans with 3 clusters has .32 which is the highest followed by kmeans with 5 clusters has .31.

      Technique  Num of Cluster Avgerage Silhouette Width
   Kmeans          3          0.32
  Kmeans          5          0.31
  agglomer         2          0.31
  agglomer          5          0.30
  Kmeans           6          0.30

Based on this I am thinking of to use kmeans with 3 clusters.

I also looked at performing PCA and then doing K means on the first 3 principal compenents. To do this I:

(1) Center the original data (2) use princomp in R and multiply the original data %*% princomp$loadings to get scores (3) Performs a similar technique as described above using various clustering techniques and varaious Ks

The top silhouettes are

Technique       Number of Cluster   Avgerage Silhouette Width
agglomerative          2          0.44
kmeans                 2          0.41
Kmeans                 5          0.33
agglomerative          5          0.33
agglomerative          4          0.33

You can see with PCA the top 2 clustering methods are better than the raw data at agglomerative =.44 and kmeans =.41 but they only have 2 clusters.

questions:

What a "good" silhouette average width? .33 and .44 are low in practice? I have seen <.5 means weak clustering. What do people see empirically?

Any advise you can give on which clustering (raw data vs. pca) to use...Any other steps to perform to parse out the proper technique?

Any thoughts on ways to increase the silhouette avg width?

I know expert analysis of the clusters is necessary but looking for thoughts.

I am not typically a fan of k-means, or other heuristic clustering approaches because defining "appropriate clusters" is always difficult and subjective. Looks like you're using R, so I recommend the `mclust` library. It uses Bayes factors for determining appropriate clusters and sizes. — Jon, Oct 19 '16 at 15:55
thanks. do you use the silhouette value? if so is .33 -.4 in the ballpark — user3022875, Oct 19 '16 at 16:25
@Jon just checked the silhouette plot for mluster classification on raw data is .2 and .16 for the pca data — user3022875, Oct 19 '16 at 17:22
1) Why is your corr matrix not symmetric? 2) Silhouette of .3-.4 is still not high. Value some .6-.7 or higher designate "quite good" clusters. 3) All your values are roughly same, it is funny to prefer .32 to .30. It looks that you have quite weak cluster structure in your data. 4) "PCA's" silhouette results would't be compared directly with the "variables" results because these are not identical datasets - the number of variables is different: [see](http://stats.stackexchange.com/q/222675/3277). — ttnphns, Oct 19 '16 at 18:31
Its entries are not symmetric (not equal) about its diagonal. var1-var2 = 0.32, but var2-var1= 0.4. How can that be? — ttnphns, Oct 19 '16 at 20:19
that is a typo from typing the numbers. i can update. any thoughts on a strategy? — user3022875, Oct 19 '16 at 20:39
Check: http://stats.stackexchange.com/questions/23472/how-to-decide-on-the-correct-number-of-clusters — Tim, Oct 20 '16 at 07:06
@ttnphns , do you have a reference for "2) Silhouette of .3-.4 is still not high. Value some .6-.7 or higher designate "quite good" clusters" ? thanks — André Costa, Aug 28 '18 at 17:47

strategy to determine number of groupings kmeans

0 Answers0