Can you offer advice on a clustering strategy?
I have 4 continuous variables and I would like to perform cluster analysis.
The correlation matrix of the varaibles is
Var1 Var2 Var3 Var4
Var1 1.0000000 0.32 -0.46 0.24
Var2 0.4 1.0 -0.29 0.0
Var3 -0.46 -0.4 1.0 0.36
var4 0.27 0.01 0.30 1.0
No variables are highly correlated so I include all 4.
First I perform clustering on the raw data by:
(1) center and scale the variables (2) perform various clustering techniques kmeans, pam, and agglomerative heirarcial with various Ks specified (3) examine the average silhouette width
The top average silhouetter widths for the different techniques are below:
Kmeans with 3 clusters has .32 which is the highest followed by kmeans with 5 clusters has .31.
Technique Num of Cluster Avgerage Silhouette Width
Kmeans 3 0.32
Kmeans 5 0.31
agglomer 2 0.31
agglomer 5 0.30
Kmeans 6 0.30
Based on this I am thinking of to use kmeans with 3 clusters.
I also looked at performing PCA and then doing K means on the first 3 principal compenents. To do this I:
(1) Center the original data (2) use princomp in R and multiply the original data %*% princomp$loadings to get scores (3) Performs a similar technique as described above using various clustering techniques and varaious Ks
The top silhouettes are
Technique Number of Cluster Avgerage Silhouette Width
agglomerative 2 0.44
kmeans 2 0.41
Kmeans 5 0.33
agglomerative 5 0.33
agglomerative 4 0.33
You can see with PCA the top 2 clustering methods are better than the raw data at agglomerative =.44 and kmeans =.41 but they only have 2 clusters.
questions:
What a "good" silhouette average width? .33 and .44 are low in practice? I have seen <.5 means weak clustering. What do people see empirically?
Any advise you can give on which clustering (raw data vs. pca) to use...Any other steps to perform to parse out the proper technique?
Any thoughts on ways to increase the silhouette avg width?
I know expert analysis of the clusters is necessary but looking for thoughts.