I have a dataset with 266 observations with categorical variables of multiple categories. I am using the function hclust
in R and the function daisy
(with Gower's distance) to create the dissimilarity matrix. I have two questions. First, I was wondering if I need to transform the data into binary, since I have heard that it is sometimes needed (however, nothing like that is mentioned in R documentation). Also, I am not sure which are the method's assumptions, since I haven't encountered any in the books that I have read.
Asked
Active
Viewed 435 times
1

Anna
- 53
- 4
-
2No, you should not transform your categorical variables into dummy sets. Gower can process categorical ones as they are. https://stats.stackexchange.com/q/15287 – ttnphns Sep 02 '19 at 15:07
1 Answers
1
This probably depends on the problem you're solving. Based on my limited reading, I don't think Gower's distance requires binary transformation, it works for continuous as well.
Basically, let's say you're counting whether a protein is present or absent in a cell. Our experiment measures how much protein appears, and looks like: 0.3, 0.2, 0, 0, 0, 0.5
Then binary transformation is expected. We only care about protein presence/absence, not quantity. Otherwise 0.5 and 0.2 are more dissimilar than 0.2 and 0.
In contrast, let's say you're measuring car speeds. 30, 20, 7, 7, 7, 50.
Then binary transformation is not necessary, as we only care about car speeds, not whether a car is driving or not.

Hallo
- 13
- 5
-
thank you for your answer. If you also have any literature to suggest on the assumptions of hierarchical agglomerative clustering it would be great. – Anna Sep 02 '19 at 13:24
-
Gower is explained in detail in this Q&A https://stats.stackexchange.com/questions/15287/hierarchical-clustering-with-mixed-type-data-what-distance-similarity-to-use – mdewey Sep 02 '19 at 16:07