0

I have a dataset with about 800 observations, each with about 2000 boolean variables. I would like to cluster the observations. Now, I'm pretty new to all of this so I hope you'll bear with me.

My first thought was to use agglomerative hierarchical clustering. After looking into various linkage methods, I don't think I can find exactly what I want. For each clustering step I want the new cluster to contain all the "true" of the previous clusters it consists of.

So lets say we start with the following observations:

   V1 V2 V3 V4 V5 V6
O1 X  X
O2 X  X  X
O3       X  X  X
O4          X  X  X
O5       X        X

The first clusters to be formed should look something like:

    V1 V2 V3 V4 V5 V6
Ca1 X  X  X           (containing O1,O2)
Ca2       X  X  X  X  (containing O3,O4)
Ca3       X        X  (containing O5)

Further in the proces it could look like:

    V1 V2 V3 V4 V5 V6
Cb1 X  X  X           (containing O1,O2)
Cb2       X  X  X  X  (containing O3,O4,O5)

As it moves up the hierarchy it should absorb all the "True" of the previous cluster. The top of the hierarchy is a single cluster with all the variables set to "True".

Does this mean that, each time a new cluster is formed, a new dissimilarity matrix must be calculated? Does this exist? What is this called?

Sorry if I'm being unclear, I'll try to answer any questions to my best effort.

Edit: Changed wording in title (dichotomous to binary, removed word)

roger
  • 3
  • 2
  • If you are new to this theme you should first read how agglomerative hieraechical cluster works. It starts with computing a (dis)similarity matrix between the objects. So, the first step of your choice is to select a (dis)similarity measure for binary data (there are many dozens of such measures). Next step is to select the linkage method. – ttnphns Jan 31 '22 at 11:20
  • Thank you for that, I should've put that in the question. For a (dis)similarity measure I've chosen the "Hamming" method, since it seems specifically developed for binary data. For a linkage method I've selected "average". As I understand it, a (dis)similarity matrix is calculated once in the beginning. After a cluster is formed the average of their (dis)similarity scores is comnpared to the scores of other clusters. Am I correct so far? – roger Jan 31 '22 at 15:44
  • roger, sorry, I've deleted my last comment by mistake. Have your read it? I hope you have, it was about "choosing a linkage method" and led to the page https://stats.stackexchange.com/q/195446/3277 – ttnphns Feb 01 '22 at 13:12
  • Hi, I wasn't able to read it until now. Thanks for the link. It seems the solution I initially thought of was not feasible. I'll close this question and post another for my current scenario. Thanks for the help! – roger Feb 02 '22 at 10:33
  • BTW, while not the answer to my initial question, in the end it seemed that the Jaccard method provided sufficiently usable results. – roger Feb 02 '22 at 14:03
  • I have an overview, with formulas, of many binary proximity measures, in case you are interested. It is on my web page, in the collection Various proximities (read the description of KO_proxbin macro there) – ttnphns Feb 02 '22 at 19:34
  • Thanks, I'm having a look right now :) – roger Feb 03 '22 at 21:00

0 Answers0