0

How would I determine a correlation between sets of integers such as the following:

set A: 1, 2, 3, 4
set B: 2, 3, 4
set C: 4, 5
set D: 2, 5

I want to have a procedure that will let me compute things like "if a set contains 2, there's a 75% chance it also contains 4", and to do that for all pairs of numbers that exist in the sets to get a correlation matrix.

What I really want to see is, if I have a large number of these sets, are there groups of these sets that are highly similar to each other and somewhat dissimilar to other groups. The labels of the sets themselves (A, B, C, D) is arbitrary and unimportant. Only the contents of the sets are significant.

I could write some code to compute these correlations pretty easily, but I am wondering if there is some more sophisticated techniques for getting this kind of information out of it, but I don't know what to call this sort of correlation so it's difficult to google. Any suggestions?

Erik
  • 101
  • 2
  • mutual information? – Vladislavs Dovgalecs Jul 16 '15 at 23:04
  • Well this isn't a complete answer, but your question about grouping them sounds like you want clustering. You might find the answers to this question helpful: http://stats.stackexchange.com/questions/86318/clustering-a-binary-matrix – Davis Yoshida Jul 17 '15 at 00:18
  • Similarity between sets can be defined as $(A \cap B) / (A \cup B)$, and then you could try to find a clustering so that this value was close to one for sets in the same cluster and close to zero for sets in different clusters. It should be possible to use some existing clustering method along with this idea. – dsaxton Jul 17 '15 at 00:57
  • Mutual information seems promising, but how do you calculate a joint probability from the binary presence/absence of elements in sets? – Erik Jul 17 '15 at 15:43

0 Answers0