1

I have 371 samples; each sample has 20,000+ attributes. The data are all numeric. I want to see whether my samples can be clustered, so I first decided to reduce my data with PCA. From PCA, I got 250 principal components that account for around 90% variance.

From this, I am thinking whether I can calculate my 371x371 distance matrix on these 250 principal components so that I can use it for hierarchical clustering. I tried to calculate on all 20,000 attributes for the distance matrix but it took too long. I think if I can use these 250 principal components, I can speed up my distance matrix calculation.

So my question is: can principal components be used for calculating a distance matrix for hierarchical clustering? Is this valid mathematically?

*Additional info: I have tried using principal components for clustering the iris data in R. The result is quite good although there is only 78% accuracy. I think it's because the dimension is really low (only 3), so if I reduce the dimension into 2 dimension, it cannot cover more than 90% of the variance.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Bharata
  • 143
  • 7
  • Generally, you may do that. Why so many attributes? Are they binary attributes? – ttnphns Oct 27 '17 at 10:05
  • Partial least squares is a preferred approach for use in dimension reduction when the number of features are large and *n* is small. – Mike Hunter Oct 27 '17 at 10:08
  • What is your question? It's not clear from what you've written. – MachineEpsilon Oct 27 '17 at 14:41
  • basically, I just want to hear your the opinion regarding usage of PC axis for distance matrix. Anyway, I have added the question . – Bharata Oct 27 '17 at 23:40
  • There are many attributes because it is from gene expression data. I want to cluster similar patient with similar gene expression level. – Bharata Oct 27 '17 at 23:41
  • Reducing the dimensionality from $20,\! 000\rightarrow 250$ may sound impressive, but with $N=371$, only $370$ dimensions are possible (see [here](https://stats.stackexchange.com/q/123318/7290)), so that may not be what you think. I would be pretty concerned about the reliability of the PCA results; I'm not sure you want to hang your hat on them. – gung - Reinstate Monica Oct 27 '17 at 23:55
  • well, in the nature of gene expression level, there are many genes that is called "house keeping gene". These gene expression level usually doesn't really important in distinguishing each sample because their expression level probably really similar among the samples. This is what my aim is. To find gene that can be used to represent the samples. The 250 dimension covers more than 90% variance so I think it good enough. More than this, the variance coverage doesn't really give much increment. – Bharata Oct 28 '17 at 02:39

0 Answers0