Measures of closeness between distributions, clusterings, data sets or other objects.
Questions tagged [similarities]
460 questions
50
votes
6 answers
Percentage of overlapping regions of two normal distributions
I was wondering, given two normal distributions with $\sigma_1,\ \mu_1$ and $\sigma_2, \ \mu_2$
how can I calculate the percentage of overlapping regions of two distributions?
I suppose this problem has a specific name, are you aware of any…

Ali Salehi
- 603
- 1
- 6
- 5
47
votes
2 answers
Hierarchical clustering with mixed type data - what distance/similarity to use?
In my dataset we have both continuous and naturally discrete variables. I want to know whether we can do hierarchical clustering using both type of variables. And if yes, what distance measure is appropriate?

Beta
- 5,784
- 9
- 33
- 44
33
votes
1 answer
Comparing hierarchical clustering dendrograms obtained by different distances & methods
[The initial title "Measurement of similarity for hierarchical clustering trees" was later changed by @ttnphns to better reflect the topic]
I am performing a number of hierarchical cluster analyses on a dataframe of patient records (e.g. similar to…

Wouter
- 2,102
- 3
- 17
- 26
28
votes
1 answer
Converting similarity matrix to (euclidean) distance matrix
In Random forest algorithm, Breiman (author) constructs similarity matrix as follows:
Send all learning examples down each tree in the forest
If two examples land in the same leaf increment corresponding element in similarity matrix by 1
Normalize…

Uros K
- 467
- 1
- 6
- 9
26
votes
2 answers
Similarity Coefficients for binary data: Why choose Jaccard over Russell and Rao?
From Encyclopedia of Statistical Sciences I understand that given $p$ dichotomous (binary: 1=present; 0=absent) attributes (variables), we can form a contingency table for any two objects i and j of a sample:
j
1 0
-------
…

wflynny
- 455
- 1
- 6
- 10
25
votes
5 answers
Compute a cosine dissimilarity matrix in R
I want to create heatmaps based upon cosine dissimilarity.
I'm using R and have explored several packages, but cannot find a function to generate a standard cosine dissimilarity matrix. The built-in dist() function doesn't support cosine distances,…

Greg Slodkowicz
- 405
- 1
- 5
- 10
22
votes
5 answers
Similarity measures between curves?
I would like to compute the measure of similarity between two ordered sets of points---the ones under User compared with the ones under Teacher:
The points are curves in 3D space, but I was thinking that the problem is simplified if I plotted them…

Alex
- 321
- 1
- 3
- 4
21
votes
4 answers
Euclidean distance score and similarity
I'm just working with the book Collective Intelligence (by Toby Segaran) and came across the Euclidean distance score. In the book the author shows how to calculate the similarity between two recommendation arrays (i.e. $\textrm{person} \times…

navige
- 325
- 1
- 2
- 6
17
votes
1 answer
What are the difference between Dice, Jaccard, and overlap coefficients?
I come across three different statistical measures to compare two sets, in particular to segmentation on images (e.g., comparing the similarity between the ground truth and the segmented result).
What are the differences between these measurements…

RockTheStar
- 11,277
- 31
- 63
- 89
17
votes
3 answers
Can someone please explain dynamic time warping for determining time series similarity?
I am trying to grasp the dynamic time warping measure for comparing time series together. I have three time series datasets like this:
T1 <- structure(c(0.000213652387565, 0.000535045478866, 0, 0, 0.000219346347883,
0.000359669104424,…

Legend
- 4,232
- 7
- 37
- 50
16
votes
1 answer
What is the optimal distance function for individuals when attributes are nominal?
I do not know which distance function between individuals to use in case of nominal (unordered categorical) attributes.
I was reading some textbook and they suggest Simple Matching function but some books suggest that I should change the nominal to…

Jane Doe
- 311
- 1
- 2
- 6
15
votes
4 answers
What is the purpose of row normalization
I understand the reasoning behind column normalization, as it causes features to be weighted equally, even if they are not measured on the same scale - however, often in the nearest neighbour literature, both columns and rows are normalized. What is…

curiosity_delivers
- 173
- 1
- 1
- 8
15
votes
3 answers
Quantifying similarity between two data sets
Summary: Trying to find the best method summarize the similarity between two aligned data sets of data using a single value.
Details:
My question is best explained with a diagram. The graphs below show two different data sets, each with values…

Gabriel Southern
- 271
- 1
- 2
- 8
12
votes
3 answers
Distance Metrics For Binary Vectors
I have vectors of same length consisting of 1 and 0. I am trying to find out how similar they are. So far I am using hamming distance that I calculate sum of one vector then sum of second vector and the difference between this is the difference of…

totpiko
- 241
- 1
- 2
- 3
12
votes
2 answers
Does Mercer's theorem work in reverse?
A colleague has a function $s$ and for our purposes it is a black-box. The function measures the similarity $s(a,b)$ of two objects.
We know for sure that $s$ has these properties:
The similarity scores are real numbers between 0 and 1,…

Sycorax
- 76,417
- 20
- 189
- 313