3

I want to use PCA before clustering, and then I want to run a clustering algorithm such as K-Means.

My understanding is that I run PCA and find loadings for each original variable, then calculate scores for each record with linear combinations of row values multiplied by each PC loadings, then run clustering on the calculated PCA scores.

Is it correct or do I need to do more before to run clustering on them?

amoeba
  • 93,463
  • 28
  • 275
  • 317
user122358
  • 1,303
  • 2
  • 13
  • 28
  • 1
    sounds right to me. If using R, `prcomp(d)$x` has the rotated data. don't forget that the data is not scaled by default. – generic_user May 19 '16 at 14:27

1 Answers1

4

PCA decomposes the covariance matrix into rotation and scaling.

If you only use rotation, you should get the exact same result with k-means. So you gained nothing.

Two ways of using the scaling information:

  1. scale every projected attribute to unit variance
  2. discard attributes with low variance
  3. both.
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • I scale variables into normalized ones before I do PCA, but do I need to scale new values on PC axes into normalized ones again? The way I think is that I don't need to, because they are already zeron mean centered in PCA. Am I correct? – user122358 May 22 '16 at 06:40
  • You don't need to scale them prior to PCA. The results will be different though. After the rotation, you will keep the zero mean, but *not* the variances, so scaling *does* have an effect. – Has QUIT--Anony-Mousse May 22 '16 at 07:12
  • 1
    +1. I illustrated this in my answer here: http://stats.stackexchange.com/questions/230319. – amoeba Aug 17 '16 at 23:17