Can categorical variables be used in hierarchical clustering? I have heard only continuous variables are used, but have seen people discussing categorical variables may / may not be used as well. Can anyone provide insight?
Asked
Active
Viewed 9,334 times
12

gung - Reinstate Monica
- 132,789
- 81
- 357
- 650

Windstorm1981
- 314
- 2
- 14
-
2Yes of course, categorical data are frequently a subject of cluster analysis, especially hierarchical. A lot of proximity measures exist for binary variables (including dummy sets which are the litter of categorical variables); also entropy measures. Clusters of cases will be the frequent combinations of attributes, and various measures give their specific spice for the frequency reckoning. One problem with clustering categorical data is stability of solutions. And [this](http://stats.stackexchange.com/q/218604/3277) recent question puts forward the issue of variable correlation. – ttnphns Jun 22 '16 at 21:22
-
Search this site for `hierarchical clustering categorical` to read related threads. – ttnphns Jun 22 '16 at 21:44
-
Possible duplicate of [Clustering of mixed type data with R](https://stats.stackexchange.com/questions/24540/clustering-of-mixed-type-data-with-r) – kjetil b halvorsen Oct 30 '18 at 17:31
-
I don't think this is a duplicate, exactly. The linked question is about R, and might even be off-topic now. This question is about statistics and doesn't mention a software package. – Peter Flom Oct 31 '18 at 11:57
-
@ttnphns: do you want to post your comment(s) as an answer? [Better to have a short answer than no answer at all.](https://stats.meta.stackexchange.com/a/5326/1352) Anyone who has a better answer can post it. – Stephan Kolassa Apr 27 '19 at 20:53
-
[This article](http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/#case-2-clustering-on-categorical-data) shows how to perform cluster analysis on categorical variables using principle components to convert categorical variables into continuous ones. – Electromagnet Apr 14 '20 at 09:25
1 Answers
4
Yes of course, categorical data are frequently a subject of cluster analysis, especially hierarchical. A lot of proximity measures exist for binary variables (including dummy sets which are the litter of categorical variables); also entropy measures. Clusters of cases will be the frequent combinations of attributes, and various measures give their specific spice for the frequency reckoning. One problem with clustering categorical data is stability of solutions. And this recent question puts forward the issue of variable correlation.

mkt
- 11,770
- 9
- 51
- 125
-
1I've copied this comment by @ttnphns as a community wiki answer because the comment is, more or less, an answer to this question. We have a dramatic gap between answers and questions. At least part of the problem is that some questions are answered in comments: if comments which answered the question were answers instead, we would have fewer unanswered questions. – mkt Aug 27 '19 at 16:00