1

I am trying to conduct the k-means cluster analysis. Because some variables are highly related, I conducted PCA to reduce variables. But some variable shows low communalities and the KMO is lower than 0.5.

I would like to conduct PCA only with highly related variables, and conduct cluster analysis using both, (a subset of) PCA scores and the remaining variables. For example, 3 PCA scores (from 10 variables) + 5 individual variables.

Is it possible? If it is possible, can you give me some examples?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 2
    It is definitely possible. Just transform your data using PCA and treat the resulting 3 variables as new variables. Yet I do not know any example where such an approach has been done or whether it is a good idea at all. Did you try clustering with (a) the original 10 dimensions and (b) with 3 dimensions after PCA? – Nikolas Rieble Jan 11 '17 at 14:12
  • You apparently are misunderstanding certain things. KMO, for example, can be meaningful in the context of factor analysis, not PCA. Or are you using PCA as a FA method? [KMO](http://stats.stackexchange.com/q/229244/3277) is about partial correlations, not about `only with highly related variables`. – ttnphns Jan 11 '17 at 14:15
  • `conduct PCA only with highly related variables, and conduct cluster analysis using both PCA score and remaining variables` Please describe specifically what you mean here. PCs extracted already have information of the input variables. – ttnphns Jan 11 '17 at 14:20
  • 2
    @ttnphns I think the OP is talking about doing PCA/FA on several "highly related" variables, taking leading PCs, taking *in addition* individual variables that were left out of PCA, and doing clustering on all that. In his example, 10 variables go into PCA, 3 scores are taken together with 5 different variables (that did not go into PCA), and clustering is done in the resulting 5+3=8 dimensions. – amoeba Jan 11 '17 at 17:18
  • @amoeba, that would be quite trivial (to invoke a question). But let the OP clarify it. – ttnphns Jan 11 '17 at 17:24

1 Answers1

1

PCA is an affine linear transformation of your data set.

Since k-means does not use correlations, the rotation and translation aspects of PCA does not have any effect. So what PCA reduce to is reducing the effect of the main componentd, and boosting that of the error components - I'm not surprised that it does not work too well.

Using both the original and the transformed features likely won't help much. Because they have different scales. Usually either the original, or the new features will dominate the result. And even if the total variance of the original and the transformed features is the same, you would essentially do a "half PCA" then; essentially the same as doing PCA, but rather than scaling every component to have unit variance, you scale it to have "original variance + 1".

If you had an approach that would do feature selection (k-means does not; feature selection usually needs training labels for good results), then this would not hold. But for k-means, this will not change much.

Don't try to solve problems by stacking as many tools together as possible. Understand what you need to solve, and how (and which) functions may help get you there.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • As I read the question the OP wants to have a small number of components drawn from PCA on some of the variables and then supplement them by some original variables which were not part of the input to the PCA. Does your answer still apply if i am right? Or am I wrong? – mdewey Jan 29 '17 at 17:06
  • 1
    Essentially it still applies, only that the weights become more complicated - harder to control, less supported by theory. – Has QUIT--Anony-Mousse Jan 29 '17 at 18:16