0

I am working with a binary data set. There are 'm' bacteria models and 'n' attributes (i.e. genes) in total. The data set represents the attribute composition in each model (1 for present and 0 for absent). Below is a view of the data set. As shown all the models do not have all the attributes. However there are some attributes that are constant in all the models. e.g. attributes A1, A2 and A3.

Model   A1  A2  A3  A4  A5  A6  A7
M1      1   1   1   1   1   1   0
M2      1   1   1   1   1   1   1
M3      1   1   1   1   0   0   0
M4      1   1   1   0   0   1   0
M5      1   1   1   1   0   0   0
M6      1   1   1   0   1   0   1
M7      1   1   1   0   0   1   1

I want to cluster the models based on their attributes and group models with similar attributes. I want to know;

  • If I remove attributes A1, A2 and A3 from the data set before the analysis (solely because they are constant in all the models) will it effect my analysis? Or

  • Is it always a must to do a PCA (or any other statistical validation) prior deciding which variables to remove?

I would like to know what the common practice is in such a scenario.

SriniShine
  • 133
  • 1
  • 8
  • `PCA (for binary data)` What do you mean? Are you about some special form of PCA? or just classic PCA? – ttnphns Jun 15 '17 at 08:43
  • To read a general (maybe not answering your question) topic: https://stats.stackexchange.com/q/39024/3277 – ttnphns Jun 15 '17 at 09:01
  • @ttnphns I meant PCA for binary data. There is a special term for it. I think factor analysis. – SriniShine Jun 15 '17 at 09:16
  • Srini, Despite the PCA issue is not really the core of your question, it might be good for you to elucidate in your question what "binary PCA" you specifically mean. Maybe leaving a reference. [This](https://stats.stackexchange.com/q/16331/3277) thread and further links in it might also be a directing sign for you, in that. – ttnphns Jun 15 '17 at 09:33
  • 1
    Is your question about attributes which are _constant_ across cases (such as A1, A2, A3 in your example) or is about non-constant yet identical (or very much similar) attributes? It is important to specify in the question – ttnphns Jun 15 '17 at 21:51
  • 1
    [Should one remove highly correlated variables before doing PCA?](https://stats.stackexchange.com/questions/50537) appears to address the first question about the effect of removing attributes. The second question has been addressed (in passing) in many threads about model building and variable selection, where it has been noted that since PCA of the independent variables tells you absolutely nothing about the response variable(s), performing that PCA certainly shouldn't be your sole means of variable selection! – whuber Jun 15 '17 at 21:53
  • @ttnphns my question is about the attributes which are constant across cases. Thank you. I will edit the question accordingly. – SriniShine Jun 19 '17 at 13:25
  • @ttnphns And this question is not particularly about PCA. I need to know whether there should be any statistical used always in determining which variables to be removed. – SriniShine Jun 19 '17 at 13:36
  • 1
    Srini, Please visit my latest post https://stats.stackexchange.com/q/285892/3277 – ttnphns Jun 20 '17 at 12:13
  • @ttnphns you have put it in a nicer way. Thank you. – SriniShine Jun 21 '17 at 21:08

0 Answers0