Gower distance with R functions; "gower.dist" and "daisy"

Question

I have 9 numeric and 5 binary (0-1) variables, with 73 samples in my dataset. I know that the Gower distance is a good metric for datasets with mixed variables.

I tried both daisy(cluster) and gower.dist(StatMatch) functions. We can assign weights in both fuctions; I assigned weights like that; 5 weights for numeric attributes and 1 for binary ones.

But they give different distance matrixes. Shouldn't they give the same results?
These are my features and first sample.

A    B      C   D   E   F   G   H    I       J       K   L       M       N  
800 1200    0   0   0   0   1   2   0.31    0.33    0.1 0.62    0.35    0.44

A; Numeric (Square feet) B; Numeric (Dollar) C-D-E-F-G; Binary (Yes-No) H; Numeric (Number of children) J-K-L-M-N Numeric (Percent)

cdeterman · Accepted Answer · 2014-11-13T17:32:16.660

They in fact do give the same results. I am not sure how you are comparing them but here is an example:

# Create example data
set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create numeric variables
vars <- as.matrix(replicate(17, rnorm(30)))
df <- data.frame(nom, vars)

library(cluster)
daisy.mat <- as.matrix(daisy(df, metric="gower"))

library(StatMatch)
gower.mat <- gower.dist(df)

# you can look directly to see the numbers are the same
head(daisy.mat, 3)
head(gower.mat, 3)

# now identical will return FALSE, why?
identical(daisy.mat, gower.mat)
> identical(daisy.mat, gower.mat)
[1] FALSE

# This is because there is of extremely small differences 
# in the numbers returned by the different functions
max(abs(daisy.mat - gower.mat))
> max(abs(daisy.mat - gower.mat))
[1] 5.551115e-17

# Using all.equal has a higher tolerance threshold
all.equal(daisy.mat, gower.mat, check.attributes = F)
> all.equal(daisy.mat, gower.mat, check.attributes = F)
[1] TRUE

Now that I understand you are adding an extra component to the daisy function there is still a solution. It lies in the documentation for gower.dist. The key part is in the first part of the documentation, namely that columns of mode logical will be considered as binary asymmetric variables. So you want to make sure your data structure is appropriate.

set.seed(123)
# create nominal variable
nom <- factor(rep(letters[1:3], each=10))
# create binary variables
bin <- as.matrix(replicate(5, rep(sample(c(0,1), 30, replace=T))))
# create numeric variables
vars <- as.matrix(replicate(9, rnorm(30)))
df <- data.frame(nom, bin, vars)

# You can see that the columns are not 'logical' types
# We need to change this
str(df)
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : num  0 1 0 1 1 0 1 1 1 0 ...
     $ X2  : num  1 1 1 1 0 0 1 0 0 0 ...
     $ X3  : num  1 0 0 0 1 0 1 1 1 0 ...
     $ X4  : num  0 1 0 1 0 0 1 0 0 1 ...
     $ X5  : num  1 0 0 0 0 1 0 0 0 1 ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# make columns logical
df[,2:6] <- sapply(df[,2:6], FUN=function(x) ifelse(x==1, TRUE, FALSE))

# now the columns are the correct types
> str(df)
'data.frame':   30 obs. of  15 variables:
     $ nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
     $ X1  : logi  FALSE TRUE FALSE TRUE TRUE FALSE ...
     $ X2  : logi  TRUE TRUE TRUE TRUE FALSE FALSE ...
     $ X3  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
     $ X4  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
     $ X5  : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
     $ X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ...
     $ X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ...
     $ X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ...
     $ X4.1: num  0.298 0.637 -0.484 0.517 0.369 ...
     $ X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ...
     $ X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ...
     $ X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
     $ X8  : num  0.134 0.221 1.641 -0.219 0.168 ...
     $ X9  : num  0.704 -0.106 -1.259 1.684 0.911 ...


# now you can do your calls
daisy.mat <- as.matrix(daisy(df, metric="gower", type=list(asymm=c(2,3,4,5,6))))
gower.mat <- gower.dist(df)

# and you can see that the results are the same
all.equal(as.matrix(daisy.mat), gower.mat, check.attributes = F)
[1] TRUE

Thank you Dr. Determan. Yes, they gave the same results if I do not specify in "daisy" function that my binary variables are asymmetric. So "daisy" got them as Interval. As I understood we cannot specify variable type to "gower.dist" function. — Emrah Bilgiç, Nov 13 '14 at 01:55
Dr. Determan, you can see my data below my question. I can not specify my binary variables in "gower.dist" function I think. First code for gower.dist function: > gower.dist daisy — Emrah Bilgiç, Nov 13 '14 at 17:13
Dr.Determan, as I understood, both fuctions are automatically making the required standardation for my mixed type variables. — Emrah Bilgiç, Nov 13 '14 at 17:16
@EmrahBilgiç, see above, you can in fact specify binary variables with `gower.dist` — cdeterman, Nov 13 '14 at 17:32

score 1 · Answer 2 · answered Jul 18 '17 at 20:49

Yes, They give the same result, just as proven by cdeterman.

One different I want to mention here is "gower.dist" actually use some kind of equal weights method (what they called weights in the function documents can only be 0 or 1), but "daisy" allow you to pass your weight vector by argument 'weights'.

Conclusion: If you want a more flexible way to calculate Gower Dissimilarity, I prefer using "daisy" from package "cluster". If you your main interest is building a synthetic dataset, use "gower.dist", it will save you a lot of time by direct using "NND.hotdeck".

Gower distance with R functions; "gower.dist" and "daisy"

2 Answers2