Grouping samples by clustering or PCA

Question

If I have 5 binary variables with values for 100 observations to give me a 5x100 matrix.

    y1 y2 .... y100
x1  0  1  .... 1
x2  0  0  .... 0
x3  1  0  .... 0
x4  1  1  .... 1
x5  0  0  .... 0

I want to put the observations into fairly homogeneous groups based on their values from x1 to x5. I can try PCA. However, failing any obvious groupings from PCA, what's the best way to group my samples? Should I just have a go at the various means of clustering? How do I assess the homogeneity of the resultant groups?

Sorry if my questions are a bit vague, so another question may be: What do I need to find out about my data in order to answer the questions above?

Has QUIT--Anony-Mousse · Answer 1 · 2015-05-10T18:05:59.173

While I don't have a proof for this, I doubt that PCA is a good method to use on binary data. It is really meant for continuous variables as far as I can tell.

And actually, most clustering methods are meant so, too!

But given that there can be at most $2^5=32$ different values in your data set, why don't you just use the most frequent groups, then assign the remaining observations to the group with the lowest Hamming distance?

Alternatively, you could use jaccard similarity for example, and do hierarchical clustering. This approach has a semantic meaning for binary data - Jaccard similarity is well understood. But don't forget: you can have only 32 different records - with 100, you must have plenty of duplicates. At these few attributes, your cluster hierarchy will likely degenerate to levels of "duplicates", "1 difference", "2 differences", "3 differences", "4 differences" and "inverse".

What do you mean by "the most frequent groups". Could you provide example code (R or pseudocode) woud do? — Lilo, Apr 01 '13 at 23:57
I don't use R. What is the most frequent vector - there are only 32 different vectors, there must be duplicates! — Has QUIT--Anony-Mousse, Apr 02 '13 at 00:21

D L Dahly · Answer 2 · 2013-04-02T13:06:45.630

Look into Latent Class Analysis. http://en.wikipedia.org/wiki/Latent_class_model

In summary, a latent class model explains the joint distribution of some set of dichotomous variables by assuming there are sub-groups within your population, and that the observed variables are independent, given sub-group membership. The method effectively allows you to cluster observations based on the set of observed dichotomous variables.

LCA is analogous, or closely related to, latent profile analysis and finite mixture modelling.

score 2 · Answer 3 · answered May 02 '13 at 14:06

You might be interested in correspondence analysis, which is supposed to be a categorical version of PCA. In R, these are implemented in, for example, the packages ade4 and FactoMineR.

Have you tried making a dendogram of your data? This might give you a way to eyeball the number of clusters.

score 0 · Answer 4 · edited May 10 '15 at 17:48

0

I am by no means well versed in this, but I have been looking into clustering / PCA myself for various reasons. One thing I came across recently is transfer entropy, which could be used to generate a similarity matrix (composed of continuous values $-1 \leqslant r \leqslant 1$) which can then be run through whichever clustering algo you might choose.

Here's an open access fMRI article which uses the technique. http://www.nature.com/ncomms/journal/v4/n1/abs/ncomms2388.html

edited May 10 '15 at 17:48

Alexey Grigorev

8,147
3
26
39

answered Apr 02 '13 at 01:36

jdv

115
1
8

I am not sure this is appropriate. From what I know, transfer entropy is tailored towards dynamic data. This is not the case here. – bayerj Apr 02 '13 at 20:59
Yes you are right. – jdv May 15 '15 at 14:04

Grouping samples by clustering or PCA

4 Answers4

Linked