3

If I have 5 binary variables with values for 100 observations to give me a 5x100 matrix.

    y1 y2 .... y100
x1  0  1  .... 1
x2  0  0  .... 0
x3  1  0  .... 0
x4  1  1  .... 1
x5  0  0  .... 0

I want to put the observations into fairly homogeneous groups based on their values from x1 to x5. I can try PCA. However, failing any obvious groupings from PCA, what's the best way to group my samples? Should I just have a go at the various means of clustering? How do I assess the homogeneity of the resultant groups?

Sorry if my questions are a bit vague, so another question may be: What do I need to find out about my data in order to answer the questions above?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Lilo
  • 31
  • 2

4 Answers4

3

While I don't have a proof for this, I doubt that PCA is a good method to use on binary data. It is really meant for continuous variables as far as I can tell.

And actually, most clustering methods are meant so, too!

But given that there can be at most $2^5=32$ different values in your data set, why don't you just use the most frequent groups, then assign the remaining observations to the group with the lowest Hamming distance?

Alternatively, you could use jaccard similarity for example, and do hierarchical clustering. This approach has a semantic meaning for binary data - Jaccard similarity is well understood. But don't forget: you can have only 32 different records - with 100, you must have plenty of duplicates. At these few attributes, your cluster hierarchy will likely degenerate to levels of "duplicates", "1 difference", "2 differences", "3 differences", "4 differences" and "inverse".

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
3

Look into Latent Class Analysis. http://en.wikipedia.org/wiki/Latent_class_model

In summary, a latent class model explains the joint distribution of some set of dichotomous variables by assuming there are sub-groups within your population, and that the observed variables are independent, given sub-group membership. The method effectively allows you to cluster observations based on the set of observed dichotomous variables.

LCA is analogous, or closely related to, latent profile analysis and finite mixture modelling.

D L Dahly
  • 3,663
  • 1
  • 24
  • 51
2

You might be interested in correspondence analysis, which is supposed to be a categorical version of PCA. In R, these are implemented in, for example, the packages ade4 and FactoMineR.

Have you tried making a dendogram of your data? This might give you a way to eyeball the number of clusters.

Stijn
  • 1,550
  • 1
  • 12
  • 20
0

I am by no means well versed in this, but I have been looking into clustering / PCA myself for various reasons. One thing I came across recently is transfer entropy, which could be used to generate a similarity matrix (composed of continuous values $-1 \leqslant r \leqslant 1$) which can then be run through whichever clustering algo you might choose.

Here's an open access fMRI article which uses the technique. http://www.nature.com/ncomms/journal/v4/n1/abs/ncomms2388.html

Alexey Grigorev
  • 8,147
  • 3
  • 26
  • 39
jdv
  • 115
  • 1
  • 8