3

I have a set of objects each of which has a list of traits. Data on the traits is binary: an object has a trait or does not. The number of objects that I have is moderately greater than the number of traits, and much greater if you exclude traits that are held by only a few objects. Virtually all objects have at least 3 or four traits, and often as many as 15. Currently my sample has about 200 objects in it, but I may find more

I am looking for sensible candidate algorithms for dividing the objects into groups based on these trait measurements. In the best of all possible worlds, the objects would divide neatly into distinct groups that share traits with other group members but not with outsiders. I already know my data is not so neat. I am looking for an algorithm that assigns objects to groups in such a way that the number of groups is determined by the data, or the data plus a small number of cutoff parameters, and for which a grouping is preferred based on some combination of the following three properties:

  1. the sharing of traits within a group is high;
  2. the sharing of traits between members of different groups is low; and
  3. for the members of each group, the pattern of sharing traits between groups is similar, e.g. if member 1 of group A shares 50 percent of its traits with members of group B but only 1 percent with members of group F, then members 2 through 20 of A will also have high rates of sharing with group B and low rates with group F.

Is there a standard or a short list of common algorithms that have been used to address problems with this structure? Alternatively, is there an algorithm that someone would like to champion for this problem? I am hoping to find an algorithm that I can implement with R (or with a spreadsheet).

I do not know the relative importance of criteria 1, 2, and 3 and would ideally like to be able to vary them.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
andrewH
  • 2,587
  • 14
  • 27
  • 1
    Hierarchical clustering seems to optimal for you. There are plenty (dis)similarity measures for binary data. Choose one that will suit you best and go to h. cl. As for your point 3, it is feasible only for an extreme cluster. If a cluster is between other clusters, its points will, naturally, differ in that respect. – ttnphns Apr 03 '14 at 08:58
  • See the related question and answers here http://stats.stackexchange.com/questions/86318/clustering-a-binary-matrix/86350#86350 – D L Dahly Apr 03 '14 at 11:14

0 Answers0