3

I am looking for a significance test for the Jaccard Distance (JD).

As an example, I have two datasets as follows:

Baseline: $\left| A\bigcap B \right|=57;\ \left| A\bigcup B \right|=275\quad \therefore \ JD=0.7927$

Evaluation: $\left| A\bigcap B \right|=126;\ \left| A\bigcup B \right|=433\quad \therefore \ JD=0.7090$

Is there a way to determine whether the JD at evaluation is significantly different from the baseline?

Or do I simply use the classical z-test of proportions? The z-test assumes that there is a significant difference from the baseline.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Mari153
  • 385
  • 5
  • 16

2 Answers2

1

I found an article that describes the Jaccard index from a probabilistic perspective written by Real and Vergas in 1996: The Probabilistic Basis of Jaccard's Index of Similarity. A few years later, they even published tables of significance values (Table 3) in: Tables of significant values of Jaccard's index of similarity. Though they describe how to determine if J is significant, it may not directly answer your question... However, the statistical appoach given in (Real and Vergas, 1996) may be helpful to derive an appropriate methodology.

Btw, I would recommend not to use a z-test as I sometimes obtain significant results even for small differences between means due to small standard deviations... So to me, this test seems to be a bit overoptimitic and should not be applied when using (in my case) cross-validation/bootstrapping or similar approaches to assess the stability of estimates...

Martin D.
  • 11
  • 1
0

The Jaccard distance/index/coefficient (also known as the Tanimoto index/coefficient) is a popular measure for similarity/dissimilarity between binary data. We can directly compute the statistical significance of the Jaccard index/coefficient using a R package, jaccard on CRAN

Please note that the Jaccard distance is complementary to the Jaccard coefficient Wikipedia. So you can subtract the Jaccard coefficient from 1 to get the Jaccard distance.

Given two input vectors, its main function, jaccard.test(), computes a p-value. If your data is too big, the exact solution (accessed through method = "exact") could be slow and you may want to use a fast and accurate estimation (access through method = "mca").

  • An anonymous user contributed the following comment: "This statistical test and p-value estimation methods are explained in the recent publication in BMC Bioinformatics: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3118-5." – whuber Dec 28 '19 at 22:01