If I have two binary variables, I can determine the similarity of these variables quite easily with different similarity measures, e.g. with the Jaccard similarity measure:
$J = \frac{M_{11}}{M_{01} + M_{10} + M_{11}}$
Example in R
:
# Example data
N <- 1000
x1 <- rbinom(N, 1, 0.5)
x2 <- rbinom(N, 1, 0.5)
# Jaccard similarity measure
a <- sum(x1 == 1 & x2 == 1)
b <- sum(x1 == 1 & x2 == 0)
c <- sum(x1 == 0 & x2 == 1)
jacc <- a / (a + b + c)
jacc
However, I have a group of 1.000 binary variables and want to determine the similarity of the whole group.
Question: What is the best way to determine the similarity of more than 2 binary variables?
One idea is to measure the similarity for each pairwise combination and then take the average. You can find an example of this procedure below:
# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)
list_x <- split(mat_x, rep(1:ncol(mat_x), each = nrow(mat_x)))
# Function for calculation of Jaccard similarity
fun_jacc <- function(v1, v2) {
a <- sum(v1 == 1 & v2 == 1)
b <- sum(v1 == 1 & v2 == 0)
c <- sum(v1 == 0 & v2 == 1)
jacc <- a / (a + b + c)
return(jacc)
}
# Apply function to all combinations (takes approx. 1 min to calculate)
mat_jacc <- sapply(list_x, function(x) sapply(list_x, function(y) fun_jacc(x,y)))
mat_jacc[upper.tri(mat_jacc)] <- NA
diag(mat_jacc) <- NA
vec_jacc <- as.vector(mat_jacc)
vec_jacc <- vec_jacc[!is.na(vec_jacc)]
median(vec_jacc)
This is very inefficient though and I am also not sure if this is theoretically the best way to measure the similarity of such a group of variables.
UPDATE: According to user43849's suggestion the dissimilarity could be calculated with Sorensen's multiple-site dissimilarity:
# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)
# Multiple-site dissimilarity according to Sorensen
library("betapart")
beta.multi(t(mat_x), index.family = "sor")$beta.SOR # Vectors are not similar --> almost 1