7

If I have two binary variables, I can determine the similarity of these variables quite easily with different similarity measures, e.g. with the Jaccard similarity measure:

$J = \frac{M_{11}}{M_{01} + M_{10} + M_{11}}$

Example in R:

# Example data
N <- 1000
x1 <- rbinom(N, 1, 0.5)
x2 <- rbinom(N, 1, 0.5)

# Jaccard similarity measure
a <- sum(x1 == 1 & x2 == 1)
b <- sum(x1 == 1 & x2 == 0)
c <- sum(x1 == 0 & x2 == 1)

jacc <- a / (a + b + c)
jacc

However, I have a group of 1.000 binary variables and want to determine the similarity of the whole group.

Question: What is the best way to determine the similarity of more than 2 binary variables?

One idea is to measure the similarity for each pairwise combination and then take the average. You can find an example of this procedure below:

# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)
list_x <- split(mat_x, rep(1:ncol(mat_x), each = nrow(mat_x)))

# Function for calculation of Jaccard similarity
fun_jacc <- function(v1, v2) {

  a <- sum(v1 == 1 & v2 == 1)
  b <- sum(v1 == 1 & v2 == 0)
  c <- sum(v1 == 0 & v2 == 1)

  jacc <- a / (a + b + c)
  return(jacc)
}

# Apply function to all combinations (takes approx. 1 min to calculate)
mat_jacc <- sapply(list_x, function(x) sapply(list_x, function(y) fun_jacc(x,y)))
mat_jacc[upper.tri(mat_jacc)] <- NA
diag(mat_jacc) <- NA
vec_jacc <- as.vector(mat_jacc)
vec_jacc <- vec_jacc[!is.na(vec_jacc)]
median(vec_jacc)

This is very inefficient though and I am also not sure if this is theoretically the best way to measure the similarity of such a group of variables.

UPDATE: According to user43849's suggestion the dissimilarity could be calculated with Sorensen's multiple-site dissimilarity:

# Example data
N <- 1000 # Observations
N_vec <- 1000 # Amount of vectors
x <- rbinom(N * N_vec, 1, 0.5)
mat_x <- matrix(x, ncol = N_vec)

# Multiple-site dissimilarity according to Sorensen
library("betapart")
beta.multi(t(mat_x), index.family = "sor")$beta.SOR # Vectors are not similar --> almost 1
Joachim Schork
  • 1,068
  • 4
  • 15
  • 37
  • What do you want to use it for? There are a lot of potential ways to do this, which all mean different things.... – Danica May 21 '17 at 14:35
  • I want to compare 3 different methods based on these similarity measures. I will perform a Monte Carlo simulation, in which I will apply these 3 methods 1000 times. The result of each method and each simulation run is a binary vector. After the simulation I will have 3 x 1000 binary vectors and the similarity should be measured in each of these 3 groups. The method, which is able to produce the most similar binary vectors, is considered to be the best. – Joachim Schork May 21 '17 at 14:41
  • From a conceptual point how do you measure the similarity between more than two things. There are a few ways to extend the similarity between two things, like comparing all pairs, looking at the most different ones, averaging the the pairwise distances,... – Hooman May 23 '17 at 10:39
  • Thank you for your feedback Hooman! You wrote `"averaging the pairwise distances"`. So would you say that my idea, which I have shown in the example in my post, is a sufficient way to measure the similarity of more than 2 vectors? – Joachim Schork May 23 '17 at 12:58
  • I like hamming distance. An advantage there is that you are comprising your problem in the language of error correcting codes. – EngrStudent May 24 '17 at 13:13
  • Thanks for your input @ EngrStudent! Is it possible to use hamming distance for more than 2 variables? [Wikipedia says](https://en.wikipedia.org/wiki/Hamming_distance) that it measures the distance of **two** strings only. – Joachim Schork May 24 '17 at 13:16
  • Minor point. Do you know apriori that both variables can't both be simultaneously == 0 ? Or is that just uninteresting in this case. – meh May 27 '17 at 15:04
  • @ aginensky both variables can simultaneously be equal to 0. This scenario is not considered explicitly in the formula of the Jaccard similarity measure and for that reason I have not calculated the sum of these cases. – Joachim Schork May 29 '17 at 07:15
  • `not sure if this is theoretically the best way to measure the similarity of such a group of variables` Everything depends on what you want. You see, a similarity or a correlation between 3+ variables (vectors) is but multidimensional. I.e. to incorporate all the information of the ties it needs the matrix of pairwise coefficients. What information are you ready to sacrifice to draw just one value out of it? – ttnphns May 29 '17 at 15:36
  • @ ttnphns, do you have any other ways in mind, how a big group of vectors could be compared regarding their (dis)similarity? I am aware that a reduction to just one value can not reflect all structures in my data. However, I don't know another way, how I could determine the similarity in a practical way. – Joachim Schork May 29 '17 at 15:59
  • Are any of the binary variables in your dataset related? (e.g. dummy coding of a categorical variable) – g3o2 May 30 '17 at 08:09
  • Yes, they should be very related. I applied the same method repeatedly to slightly different data sets and these vectors are the result of this procedure. Some of them might even be exactly the same. – Joachim Schork May 30 '17 at 09:09
  • Similarity calculations are likely to suffer from the __curse of dimensionality__. In your case, the risk of irrelevant features polluting the similarity calculations is high. So you should really perform some kind of __feature selection, feature grouping or feature weighting__. This is particularly important with Jaccard index because the common absence of attributes is disregarded in the distance calculation. – g3o2 May 30 '17 at 22:54
  • @ g3o2 I will perform a Monte Carlo simulation and each simulation run results in a binary vector. The binary vectors should be as similar as possible. I think I should not exclude some of these vectors, since it would bias the results of the simulation. – Joachim Schork May 31 '17 at 07:23

3 Answers3

2

This answer will draw heavily on the ecological literature, where Jaccard and other (dis)similarity measures are commonly used to quantify the compositional (dis)similarity between species assemblages at different sites. The single best reference is Baselga (2013) Multiple site dissimilarity quantifies compositional heterogeneity among several sites, while average pairwise dissimilarity may be misleading, which is freely available here.

Basically, there are several approaches to quantifying higher-order dissimilarities (higher-order than pairwise). One is to average the pairwise dissimilarities for all pairs in the sample. This metric generally performs poorly for a variety of reasons, detailed in Baselga (2013). Another possibility is to find the average distance from an observation to the multivariate centroid.

There is an explicit generalization of the Sorensen index to more than two observations. Recall that the Sorensen index is $\frac{2ab}{a+b}$ where a is the number of species (ones in your case) in sample A, b is the number of species in sample B, and ab is the number of species shared by samples A and B (i.e. the dot product). The three-site generalization, formulated by Diserud and Odegaard (2007) and discussed by Chao et al (2012) is $\frac{3}{2}\frac{ab+ac+bc-abc}{a+b+c}$. Consult Diserud and Odegaard (2007) for the motivation behind this metric as well as extensions to $N>3$. The references in Baselga (2013) will also point you to a multi-site generalization of the Simpson index, as well as R packages to compute the multi-site Sorensen and Simpson metrics.

Some researchers have also found it useful to examine the average number of species shared by $i$ sites, where $i$ ranges from $2$ to $N$. This reveals some interesting scaling properties and unites a variety of concepts for different values of $i$. The key reference here is Hui and McGeoch (2014) available for free here. This paper also has an associated R package called 'zetadiv'.

Jacob Socolar
  • 1,768
  • 10
  • 18
  • Thanks a lot for your answer @ user43849! I just worked through the papers and they made the topic much more clear to me. Above in my post, I updated my R-example with an implementation of Sorensen's multiple-site dissimilarity suggested in Baselga (2013). – Joachim Schork May 29 '17 at 13:23
  • For what I knew Sorensen index is a synonym to Dice index having formula 2a/(2a+b+c) – ttnphns May 29 '17 at 15:28
  • Not sure what a, b, and c represent in your formula, but if a is the number of shared ones (or species), b is the number unique to one vector, and c is the number unique to the other, then these formulas are exactly equivalent. You're right that the Sorensen index is also called the Sorensen-Dice index. The formula given in the post is correct, and consistent with the definitions used in the relevant papers. See also https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient – Jacob Socolar May 29 '17 at 15:32
  • I like your direction @user43849. Some other pairwise relations are also extendable to a many to one relations. Such are mutual information and correlation. Mutual information is also easily extended to a many to many relation. – DaL May 31 '17 at 14:29
  • An important new paper on this same topic: http://onlinelibrary.wiley.com/doi/10.1111/ecog.01942/abstract – Jacob Socolar Jun 08 '17 at 20:05
1

A possible direction is to represent the problem as a graph. The variables will be the nodes and the edges will be strong enough pair wise correlation.

You can define many correlation measures based on the graph, and you should find the one that suites you most. In most common graph correlations, the denser the graph, the more correlated it is.

Possible measures might be:

  • The number (or ratio) of nodes (variables) with and edges (a strong correlation with some variable). It is easy to compute and gives you the number of independent variables. A variation on this metric is choosing only nodes with at least few edges.
  • The number of edges. This metric give higher weight to the number of correlations but might give an high score to a graph with many not connected variables.
  • The number (or ratio) of connected components. This measure is a bit more complex but it captures better our common definition os similarity. If a variable is connected to a group of variables, they tend to behave as a single unit, regardless of the number of variables. You can use dbscan in order to get the graph structure but other graph algorithms might fit your needs also.
DaL
  • 4,462
  • 3
  • 16
  • 27
  • Thank you for your answer DaL! Could you provide any examples, literature or R packages, which are able to create such a graph? Im not sure if I understand correctly, how this graph would look like. – Joachim Schork May 30 '17 at 07:09
  • You have a dbscan implementation as part of sklearn. Given the implementation you'll choose, build the graph in the format that fits it. As for the content of the graph, first create a node from each variable. Then compute your similarity measure for each pair of variables and create an edge if the relation is strong enough. For the dbscan documentation see here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html – DaL May 30 '17 at 15:10
  • Thank you for the additional explanation! I will definitely have a look into the package. – Joachim Schork May 31 '17 at 07:09
-1

Similarity is always between two items. It can then be extended to more (say three) items by first creating single representation for two items and then finding its similarity with the third item. So, when you have 1000 items you can ask the question how similar or far is any given vector from the representation of these 1000 vectors. This representation can be the mean of these 1000 vectors or it could be anything based on how you define it.

Shishir Pandey
  • 1,051
  • 2
  • 9
  • 11
  • There are much more principled ways than this to examine dissimilarities between >2 items. – Jacob Socolar May 27 '17 at 14:57
  • Thanks for your input @ Shishir Pandey. Do you know some literature about the performance of this method? Because, like user43849 says, I suspect that there might be better ways to solve my problem. – Joachim Schork May 29 '17 at 07:19