0

I have a dataset with about 800 observations, each with about 2000 boolean variables. I would like to cluster the observations. I am using scipy in Python.

For a (dis)similarity measure I've chosen the "Hamming" method, since it seems specifically developed for binary/boolean data. I've run into an issue with this method.

For example:

     V1 V2 V3 V4 V5 V6
R1 = 1  1  1  1  1  1
R2 = 1  1  1  1  1  0
R3 = 0  0  0  1  1  1
R4 = 0  0  0  1  1  0

Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/6=0.167

For my purposes however, the distance between R3 with R4 is more significant than the difference of R1 with R2. The 0 in my data stands for an absence of a variable (V). The result that I am looking for is:

Comparing R1 with R2, gives a Hamming score of 1/6=0.167
Comparing R3 with R4, gives a Hamming score of 1/3=0.333

So I guess I am looking to divide the Hamming over the the count of variables (V) where there is at least one "1" for the pair that is compared. Should I pick a different measure or should I try to a create a variation of the Hamming method?

Tim
  • 108,699
  • 20
  • 212
  • 390
roger
  • 3
  • 2

1 Answers1

1

As you noticed in the comment, what you are describing seems to be the inverse of Jaccard similarity. Jaccard similarity is defined as

$$ J(A, B) = \frac{| A \cap B |}{| A \cup B |} $$

In your case, union for R3 and R4 is V4, V5, V6, while intersection is V4 and V5, so it's 2/3. It is a similarity metric, to make it a distance metric take $1 - J(A, B)$. It's $1 - 5/6 = 1/6$ and $1 - 2/3 = 1/3$ as you wanted.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • I accidentally removed my comment, sorry about that. It indeed seems the Jaccard similarity does what I want. Thanks for the help! – roger Feb 02 '22 at 14:01