Is it OK to use correlated variables for cluster analysis?

Question

I know there is a series of regression diagnostics procedures (correlation, beta, residual, etc.) before, during, and after regression analysis. But, is there any common procedure to follow for cluster analysis (like, Ward)? What are the R commands? Thanks!

score 3 · Answer 1 · answered Aug 26 '13 at 07:35

3

Correlation can cause problems with many clustering algorithms by giving extra weight on these attributes. For k-means it seems to be a best practise to whiten the data first, for example.

However, there exist correlation clustering algorithms that are meant to process data containing multiple correlations, and cluster objects based on the correlations they exhibit.

answered Aug 26 '13 at 07:35

Has QUIT--Anony-Mousse

39,639
7
61
96

3

A data set with no *global* correlations (i.e. $Cov(X,Y)$) may still contain clusters, and may even contain correlated subsets (which is the point of doing correlation clustering, finding subsets that have *different* correlations). – Has QUIT--Anony-Mousse Nov 24 '13 at 12:43
3

As $k$-means is closely related to PCA, a natural way to whiten/remove the correlations would be running the clustering on the PC scores. – cbeleites unhappy with SX Nov 24 '13 at 14:05
What are these "correlation clustering algorithms"? Can you give an example? – gung - Reinstate Monica Jul 21 '15 at 15:57
ORCLUS for example. – Has QUIT--Anony-Mousse Jul 21 '15 at 20:24

Is it OK to use correlated variables for cluster analysis?

1 Answers1

Linked