1

I am analyzing a segregating population of plants coming from an hybridization process. The experiment consists in several field plots (according to an augmented design). In each plot a segregating population coming from an hybrid plant was seeded. Therefore, the plants into each plot are segregating. I defined several traits corresponding to morphological characteristics of the plant (eg. leaves colour, flower colour, ...). The plants in each plots are segregating. Therefore the shows different characteristics for each of those morphological traits (eg. red or green leaves) and I counted the number of plants in each plot for each of those classes. Therefore I may express the data in my data-set as 'number of plants' or as percentage of e.g green/red plants on the total number of plants in each plots. Since the genetic background of the original hybrids is not known, I would like to run a PCA and a cluster analysis in order to see which populations cluster together according to those traits. Can PCA be applied to such a data-set? Which package can be used for running such an analysis in R?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
PietroB
  • 73
  • 7
  • Could you provide more specific information about the structure of the dataset and the purpose of the analysis? From your present description is sounds like the data consist of a few counts of occurrences of traits. Instead of being the multivariate tableau required for PCA that would just be a collection of univariate frequencies. – whuber Jul 16 '14 at 14:58
  • Thanks for your answer! The experiment consists in several field plots (according to an augmented design). In each plot a segregating population coming from an hybrid plant was seeded. Therefore, the plants into each plot are segregating. I defined several traits corresponding to morphological characteristics of the plant (eg. leaves colour, flower colour, ...). The plants in each plots are segregating. Therefore the shows different characteristics for each of those morphological traits (eg. red or green leaves) and I counted the number of plants in each plot for each of those classes. – PietroB Jul 16 '14 at 15:12
  • Therefore I may express the data in my data-set as 'number of plants' or as percentage of e.g green/red plants on the total number of plants in each plots. Since the genetic background of the original hybrids is not known, I would like to run a PCA and a cluster analysis in order to see which populations cluster together according to those traits. – PietroB Jul 16 '14 at 15:16
  • 1
    Thank you. Please edit your question to include this information (because not everyone will read through all the comments). – whuber Jul 16 '14 at 15:18
  • So, according to you, I can just run a PCA with princomp with a database where every column correspond to one of the trait (eg. red leaves) expressed in percentage on the total number of plants in each plot? – PietroB Jul 18 '14 at 09:51
  • I have not made any statements of that nature. My involvement in these comments purely is to help you formulate an answerable question. Please do not read anything more into that. – whuber Jul 18 '14 at 12:54

1 Answers1

1

Can PCA be applied

Yes. You should only take care, that if your counts are low (say, about 5 plants in a lot) you have to take into account the statistical uncertainty. Since 80% as in 4 out of 5 is not the same as 80% as in 4000 out of 5000. See here for a thread that addresses this: PCA on count-based data

Which R package can I use

See here: princomp

Ytsen de Boer
  • 567
  • 4
  • 14