I have a dataset with about 800 observations, each with about 2000 boolean variables. I would like to cluster the observations. I am using scipy in Python.
For a (dis)similarity measure I've chosen the "Hamming" method, since it seems specifically developed for binary/boolean data. I've run into an issue with this method.
For example:
V1 V2 V3 V4 V5 V6
R1 = 1 1 1 1 1 1
R2 = 1 1 1 1 1 0
R3 = 0 0 0 1 1 1
R4 = 0 0 0 1 1 0
Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/6=0.167
For my purposes however, the distance between R3 with R4 is more significant than the difference of R1 with R2. The 0 in my data stands for an absence of a variable (V). The result that I am looking for is:
Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/3=0.333
So I guess I am looking to divide the Hamming over the the count of variables (V) where there is at least one "1" for the pair that is compared. Should I pick a different measure or should I try to a create a variation of the Hamming method?