Calculating Jaccard or other association coefficient for binary data using matrix multiplication

Question

I want to know if is there any possible way to calculate Jaccard coefficient using matrix multiplication.

I used this code

    jaccard_sim <- function(x) {
    # initialize similarity matrix
    m <- matrix(NA, nrow=ncol(x),ncol=ncol(x),dimnames=list(colnames(x),colnames(x)))
    jaccard <- as.data.frame(m)

    for(i in 1:ncol(x)) {
     for(j in i:ncol(x)) {
        jaccard[i,j]= length(which(x[,i] & x[,j])) / length(which(x[,i] | x[,j]))
        jaccard[j,i]=jaccard[i,j]        
       }
     }

This is quite ok to implement in R. I have done the dice similarity one, but got stuck with Tanimoto/Jaccard. Anybody can help?

Looks like @ttnphns has this covered, but since you're using R, I thought I'd also point out that a number of similarity indices (including Jaccard's) are already implemented in the `vegan` package. I think they tend to be pretty well-optimized for speed, too. — David J. Harris, Apr 28 '13 at 07:50

ttnphns · Answer 1 · 2013-04-28T12:18:48.360

We know that Jaccard (computed between any two columns of binary data $\bf{X}$) is $\frac{a}{a+b+c}$, while Rogers-Tanimoto is $\frac{a+d}{a+d+2(b+c)}$, where

a - number of rows where both columns are 1
b - number of rows where this and not the other column is 1
c - number of rows where the other and not this column is 1
d - number of rows where both columns are 0

$a+b+c+d=n$, the number of rows in $\bf{X}$

Then we have:

$\bf X'X=A$ is the square symmetric matrix of $a$ between all columns.

$\bf (not X)'(not X)=D$ is the square symmetric matrix of $d$ between all columns ("not X" is converting 1->0 and 0->1 in X).

So, $\frac{\bf A}{n-\bf D}$ is the square symmetric matrix of Jaccard between all columns.

$\frac{\bf A+D}{\bf A+D+2(n-(A+D))}=\frac{\bf A+D}{2n-\bf A-D}$ is the square symmetric matrix of Rogers-Tanimoto between all columns.

I checked numerically if these formulas give correct result. They do.

Upd. You can also obtain matrices $\bf B$ and $\bf C$:

$\bf B= [1]'X-A$, where "[1]" denotes matrix of ones, sized as $\bf X$. $\bf B$ is the square asymmetric matrix of $b$ between all columns; its element ij is the number of rows in $\bf X$ with 0 in column i and 1 in column j.

Consequently, $\bf C=B'$.

Matrix $\bf D$ can be also computed this way, of course: $n \bf -A-B-C$.

Knowing matrices $\bf A, B, C, D$, you are able to calculate a matrix of any pairwise (dis)similarity coefficient invented for binary data.

Fractions make no sense for matrices unless they commute: multiplying on the right by an inverse will otherwise give a different result than multiplying on the left. Moreover, it usually is not the case that a product of two symmetric matrices is symmetric. Do you perhaps mean component-by-component division? Could you fix up your notation to reflect what you intend is the correct formula? — whuber, Feb 07 '13 at 07:19
@whuber I don't use inversion nor multiplication of _square symmetric_ matrices. X is the binary data matrix and X'X is its SSCP matrix. `not X` is X where 1->0, 0->1. And any division here is elementwise division. Please correct my notation if you see it is not appropriate. — ttnphns, Feb 07 '13 at 07:29
@user4959, I don't know R. [Here](http://www.statmethods.net/management/operators.html) !X is recommended; however the result is boolean TRUE/FALSE, not numeric 1/0. Note that I updated my answer where I say that there is also another way to arrive at D matrix. — ttnphns, Apr 28 '13 at 07:21

score 9 · Answer 2 · edited Sep 08 '14 at 15:57

The above solution is not very good if X is sparse. Because taking !X will make a dense matrix, taking huge amount of memory and computation.

A better solution is to use formula Jaccard[i,j] = #common / (#i + #j - #common). With sparse matrixes you can do it as follows (note the code also works for non-sparse matrices):

library(Matrix)
jaccard <- function(m) {
    ## common values:
    A = tcrossprod(m)
    ## indexes for non-zero common values
    im = which(A > 0, arr.ind=TRUE)
    ## counts for each row
    b = rowSums(m)

    ## only non-zero values of common
    Aim = A[im]

    ## Jacard formula: #common / (#i + #j - #common)
    J = sparseMatrix(
          i = im[,1],
          j = im[,2],
          x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
          dims = dim(A)
    )

    return( J )
}

score 1 · Answer 3 · edited Oct 08 '16 at 04:45

This may or may not be useful to you, depending on what your needs are. Assuming that you're interested in similarity between clustering assignments:

The Jaccard Similarity Coefficient or Jaccard Index can be used to calculate the similarity of two clustering assignments.

Given the labelings L1 and L2, Ben-Hur, Elisseeff, and Guyon (2002) have shown that the Jaccard index can be calculated using dot-products of an intermediate matrix. The code below leverages this to quickly calculate the Jaccard Index without having to store the intermediate matrices in memory.

The code is written in C++, but can be loaded into R using the sourceCpp command.

/**
 * The Jaccard Similarity Coefficient or Jaccard Index is used to compare the
 * similarity/diversity of sample sets. It is defined as the size of the
 * intersection of the sets divided by the size of the union of the sets. Here,
 * it is used to determine how similar to clustering assignments are.
 *
 * INPUTS:
 *    L1: A list. Each element of the list is a number indicating the cluster
 *        assignment of that number.
 *    L2: The same as L1. Must be the same length as L1.
 *
 * RETURNS:
 *    The Jaccard Similarity Index
 *
 * SIDE-EFFECTS:
 *    None
 *
 * COMPLEXITY:
 *    Time:  O(K^2+n), where K = number of clusters
 *    Space: O(K^2)
 *
 * SOURCES:
 *    Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon (2001) A stability based
 *    method for discovering structure in clustered data. Biocomputing 2002: pp.
 *    6-17. 
 */
// [[Rcpp::export]]
NumericVector JaccardIndex(const NumericVector L1, const NumericVector L2){
  int n = L1.size();
  int K = max(L1);

  int overlaps[K][K];
  int cluster_sizes1[K], cluster_sizes2[K];

  for(int i = 0; i < K; i++){    // We can use NumericMatrix (default 0) 
    cluster_sizes1[i] = 0;
    cluster_sizes2[i] = 0;
    for(int j = 0; j < K; j++)
      overlaps[i][j] = 0;
  }

  //O(n) time. O(K^2) space. Determine the size of each cluster as well as the
  //size of the overlaps between the clusters.
  for(int i = 0; i < n; i++){
    cluster_sizes1[(int)L1[i] - 1]++; // -1's account for zero-based indexing
    cluster_sizes2[(int)L2[i] - 1]++;
    overlaps[(int)L1[i] - 1][(int)L2[i] - 1]++;
  }

  // O(K^2) time. O(1) space. Square the overlap values.
  int C1dotC2 = 0;
  for(int j = 0; j < K; j++){
    for(int k = 0; k < K; k++){
      C1dotC2 += pow(overlaps[j][k], 2);
    }
  }

  // O(K) time. O(1) space. Square the cluster sizes
  int C1dotC1 = 0, C2dotC2 = 0;
  for(int i = 0; i < K; i++){
    C1dotC1 += pow(cluster_sizes1[i], 2);
    C2dotC2 += pow(cluster_sizes2[i], 2);
  }

  return NumericVector::create((double)C1dotC2/(double)(C1dotC1+C2dotC2-C1dotC2));
}

Calculating Jaccard or other association coefficient for binary data using matrix multiplication

3 Answers3

Linked