PCA before cluster analysis

Question

I am trying to do a PCA to reduce the no. of variables in my data before performing a cluster analysis. Suppose I extract 3 principal components P1, P2 and P3. Now when I am to do the clustering, on which variables should I run my analysis? I am not very clear as to should I use all the initial variables (then how will PCA help) or should I use the extracted 3 components? A detailed answer with example will be very helpful

The short answer is that you use the extracted 3 components. — amoeba, Sep 20 '16 at 14:28
Just a comment to the answered question: beware that most packages standardize by default your original variables before running PCA. That is likely to change distances between points in your dataset and therefore cluster analysis may yield different clusters - not necessarily worse for your purposes, but often very different. — Pere, Sep 20 '16 at 14:33
@Amoeba...also please help me with a little more clarity. Let's suppose my variables are price, quantity, inventory, total daily order, days since last transaction and so on. Now if I form clusters on the basis of these variables, I can make a decision like goods having price X, quantity Y, inventory Z etc., fall in cluster 1. But how do I do the same with principal components? Prin1 Prin2 0.72729 -0.44919 0.72378 -0.40766 0.74622 -0.30813 0.68511 -0.28137 0.80647 -0.10525 0.75512 0.36593 0.64098 0.497 0.59269 0.37792 0.76335 0.13454 — Srewashi Lahiri, Sep 20 '16 at 14:59
To echo others, dimension reduction may not be necessary with 25 variables. You may do well to consider more how you standardise and feature engineer the variables you already have. — conjectures, Sep 20 '16 at 14:59
Many thanks @Conjectures...I would do the same for my related data. But when in scenario where I would go for variable reduction, there how do I use the PCs in further analysis? Should I use all the component scores of the extracted PCs for clustering? And if I do so, then how do I interpret my clusters in terms of the variables used? — Srewashi Lahiri, Sep 20 '16 at 15:02
There have been a number of good Q's and A's on the site already. Please just search `PCA cluster analysis`. — ttnphns, Sep 20 '16 at 15:44
I did @ttnphns. But I didn't get exactly what I am looking for — Srewashi Lahiri, Sep 20 '16 at 15:58
What does "I didn't get exactly what I am looking for" mean? What Q's did you read already & what did you learn from them? What do you still need to know? — gung - Reinstate Monica, Sep 20 '16 at 22:01
I wanted to know that after doing a PCA preceding cluster analysis, how do I use the results of PCA. I posted an answer "If my original data set A is a nxp matrix and the related PCs P form a pxq matrix (q=3 as per my initial question of 3 components, which implies p = no. of original variables) then K = AxP will form a nx3 matrix. I hope I can use these 3 transformed variables in clustering" - this is exactly what I wanted to know. You can help with any point you wish to add to my better understanding. Thank you — Srewashi Lahiri, Sep 20 '16 at 22:11

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

How many features are in your original data? If it is not too many (say thousands), many clustering algorithm can work in your original data.

By using PCA you are losing information. If you do not want to lose too much, you can use as many PC as possible. (assume you can afford the computational efforts and there are not curse of dimensionality problem)

If you want to check how much information you lose, you can check my answers to this post to see how to get how much information (variance) preserved by PCA.

How to calculate how much variance a set of regressors explains on another data set using PCA transformation?

To you comment:

If you really want to use PCA, you can run clustering algorithm on the transformed data. In R with toy iris data. It is pca_out$x

pca_out=prcomp(iris[,1:3])
pca_out$x
                   PC1          PC2           PC3
      [1,] -2.49088018 -0.320973364 -0.0339745251
      [2,] -2.52334286  0.178400622 -0.2329011355
      [3,] -2.71114888  0.137820058 -0.0025055723
      [4,] -2.55775595  0.315675226  0.0670512306
      [5,] -2.53896432 -0.331356903  0.0986154338
      [6,] -2.13542015 -0.750523350  0.1367151904
      [7,] -2.67669609  0.072944140  0.2311696738
      [8,] -2.42912498 -0.162931683  0.0007979233
      [9,] -2.70915877  0.572318127  0.0322430634
     [10,] -2.44080592  0.123908243 -0.1318158483
     [11,] -2.30049402 -0.641538592 -0.0654553841
     [12,] -2.41545393 -0.015273540  0.1681603305
     [13,] -2.56232620  0.242322950 -0.1666121092
     [14,] -3.03215612  0.502494126  0.0604799584
     [15,] -2.44677625 -1.179585963 -0.2360617554
     [16,] -2.24724960 -1.353446638  0.1997840653
     [17,] -2.50197109 -0.829777299 -0.0024222281
     [18,] -2.49088018 -0.320973364 -0.0339745251
     [19,] -2.00936932 -0.867984466 -0.1284528211
     [20,] -2.42654485 -0.524077475  0.1997126274

Note I am showing first 20 data points after the transformation. You can use all 3 transformed features without information loss. OR you can use first 2 columns. Then your data becomes 2 dimensional but lose some information.

Thanks for the answer. I have around 42,000 observations and 25 variables. So I want to run a PCA on the variables. Let me reframe my question. After PCA, if I extract 'x' principal components, then how am I supposed to use the result in my clustering? Should I use the extracted components? Or if I want to use a subset of the original variables, then how do I choose that subset? — Srewashi Lahiri, Sep 20 '16 at 13:49

Angela · Answer 2 · 2016-09-21T08:56:16.207

3

By doing PCA you are retaining all the important information. If your data exhibits clustering, this will be generally revealed after your PCA analysis: by retaining only the components with the highest variance, the clusters will be likely more visibile (as they are most spread out).

What you should do is to look at the scatterplot in the plan defined by your three principal components: the data should clearly be grouped in separated clusters. After you know the number of clusters, you can apply K-means algorithm to perform a classification of your dataset.

Useful links: 1. http://www.cs.colostate.edu/~asa/pdfs/pcachap.pdf 2. http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

edited Sep 21 '16 at 08:56

answered Sep 20 '16 at 14:50

Angela

163
2
12

5

`by retaining only the components with the highest variance, the clusters will be clearly visibile (as they are most spread out).` The 1st paragraph and especially its categorical claim is misleading. PCA retaining few strong components does not _guarantee_ finding clusters because clusters might be separated well in dimensions where they - as the total cloud - are not "most spread out"). – ttnphns Sep 20 '16 at 15:50
Seconding @ttnphns, it might be helpful to read this: [Examples of PCA where PCs with low variance are “useful”](http://stats.stackexchange.com/q/101485/7290). – gung - Reinstate Monica Sep 20 '16 at 22:17
1

Most of the times PCA helps in revealing clustering: "PCA constructs a set of uncorrelated directions that are ordered by their variance. In many cases, directions with the most variance are the most relevant to the clustering. Removing features with low variance acts as a filter that provides a more robust clustering." ([link](http://www.cs.colostate.edu/~asa/pdfs/pcachap.pdf) . "High dimensional data are often transformed into lower dimensional data via the PCA where coherent patterns can be detected more clearly. " [link](http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf) – Angela Sep 21 '16 at 07:53
Angy, when addressing a specific commenter you should mention his name in the form like @username, else he won't be notified anyhow and you reply will be missed. As for the content of your comment: thanks for the links; you might want to expand your answer by adding and considering them in it. – ttnphns Sep 21 '16 at 08:24
`acts as a filter that provides a more robust clustering` This pass is to an extent true. It, however, is about the stability of clusters (as found from sample to sample) and not about the ability detecting them. – ttnphns Sep 21 '16 at 08:34
1

@ttnphns I apologize, I am new here :) What about the sentence from the other paper? "coherent patterns can be detected more clearly" If directions with the most variance are the most relevant to the clustering, then clusters should likely be easier to be identified. That's the message underlying it, I think. Anyway, I have edited my comment by relaxing the conclusions. I will add the links as well. – Angela Sep 21 '16 at 08:52

score 0 · Answer 3 · answered Sep 20 '16 at 21:50

0

Thank you everyone. I wanted to know whether we use the PCs in clustering analysis and if yes, then how we use them. I figured out the answer that we don't use the PCs directly but make a transformation of the original variables based on the PCs.

answered Sep 20 '16 at 21:50

Srewashi Lahiri

341
2
3
11

1

This is unclear and possibly wrong. What do you mean by "transformation of the original variables based on the PCs"? – amoeba Sep 20 '16 at 21:51
@amoeba If my original data set A is a nxp matrix and the related PCs P form a pxq matrix (q=3 as per my initial question of 3 components, which implies p = no. of original variables) then K = AxP will form a nx3 matrix. I hope I can use these 3 transformed variables in clustering. Please correct me if I am wrong – Srewashi Lahiri Sep 20 '16 at 22:04
3

Yes, this is correct. The problem is that when you say "PCs" (as in this answer of yours), it is unclear if you refer to matrix P or to matrix K. Personally, when I say "PC" I usually refer to matrix K. If you want to be precise, you can call matrix P "PC eigenvectors" and matrix K "PC scores". To say that for clustering "we don't use PCs directly" sounds wrong; if you say "we don't use PC eigenvectors directly, but we use PC scores", then it's correct & clear. – amoeba Sep 20 '16 at 22:07
Perfect! Thanks a ton. The little confusion I was having regarding this terminology is clear now – Srewashi Lahiri Sep 20 '16 at 22:13
3

:-) Consider editing this answer of yours to make it clearer for future readers. – amoeba Sep 20 '16 at 22:19
1

This simply is not an answer. Consider deleting it and editing your question accordingly to reflect your expectations. – ttnphns Sep 21 '16 at 08:37
1

I think this *is* an answer, at least the final sentence, but would benefit from editing as suggested – Silverfish Sep 21 '16 at 08:57

PCA before cluster analysis

3 Answers3