0

What cluster method would be appropriate for a study with 16 variables both categorial and quantitative? It is a repeated measures observational study. I read in a text that k-means would be best if the data was mostly quantitative although I may have interpreted that wrong.

Also, what methods are best for throwing out certain variables prior to performing cluster analysis. I read similar articles to the study I am analyzing, and some opted to do a PCA first, or a discriminant function analysis afterwards to confirm the number of chosen clusters.

Thoughts and comments?

DJ_
  • 743
  • 3
  • 12
  • 23

1 Answers1

2

k-means needs to be able to compute means. How do you compute means for categorial data?

Similarly, how do you intend to do PCA here? You also need to center the data set for this.

Depending on your data set size, hierarchical clustering with e.g. Gower's distance may work very well. For numerical attributes, I'd try z standardization.

For larger data sets, you could still try e.g. DBSCAN and OPTICS - if you have solved the challenge of measuring similarity (which you also need for hierarchical clustering!). That is actually the first step for you to figure out: can you quantify how similar (or dissimilar) two instances are?

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • How do you compute means for categorical data - convert them into quantitative dummy variables first. Some form of appropriate scaling may be necessary in this case, otherwise categorical variables with more categories will get greater weight. – deemar Feb 13 '13 at 09:23
  • 1
    @deemar That is just a crude workaround. What good is a mean in such transformed variable(s)? It's essentially meaningless. Yes, you can then compute things. But are they any good? Plus, the discrete nature of the values will cause additional problems, and you totally lose understanding of what the algorithm really does measure. – Has QUIT--Anony-Mousse Feb 13 '13 at 09:49
  • @deemar, You might like also to look [here](http://stats.stackexchange.com/questions/40613/why-dont-dummy-variables-have-the-continuous-adjacent-category-problem-in-clust) – ttnphns Feb 13 '13 at 11:09
  • @Anony-Mousse agreed, it is a crude work around, but not without use. If you use a 0 and 1 coding for the dummy variable, then the mean of that dummy variable for a specific cluster represents the proportion of observations in that cluster that have the category label that dummy variable codes for. Admittedly, the distance defined from applying a euclidean distance to the coding has less meaning than a distance metric that specifically allows for categorical variables. – deemar Mar 04 '13 at 08:03
  • Again, that is the theory of dummy variables. The actual clustering result will have a value there, but that may still not reflect useful clusters. Variance is optimized by 0 or 1, so chances are that you end up just doing some randomized feature selection this way. – Has QUIT--Anony-Mousse Mar 04 '13 at 08:46