1

I have a data set consisting of $n$ elements with $d$ features for each element ($x_{i,f}$ means the value for the f-th feature of the i-th element). I would like to cluster this data set into $k$ clusters.

One problem I have is that one feature is a nominal one and another is a discrete ordinal one. As an example consider elements which have the following features:

  • $x_{i,1}$ : the height of a person
  • $x_{i,2}$ : the weight of a person
  • $x_{i,3}$ : country where the person lives
  • $x_{i,4}$ : nr. of friends the person has

Is it ok to use a simple k-means algorithm with an euclidian distance measure?

I would introduce a indicator variable $\delta$ with the following meaning for feature $x_{i,3}$: $$\delta(i,j) = \begin{cases} 0, & \text{if }x_{i,3} = y_{j,3}\text{,}\\ 1, & \text{else.}\end{cases}$$ So two objects $i$ and $j$ have no distance for their third feature if this feature is the same (same country) and 1 otherwise.

Or do you know a better way to do a cluster analysis in this case?

user2653422
  • 159
  • 2
  • 11
  • 2
    It is not theoretically sane to do k-means on binary ("present" vs "absent") data because means are meaningless for data which cannot be thought of as continuous. You should compute Gower similarity and then do hierarchical or k-medoids clustering. – ttnphns Jun 05 '14 at 16:21
  • If I understand it correctly the Gower similarity leads to a value between 0 and 1 and he also uses sth. like the indicator variable I described. The only different between k-medoids and k-means seems to be that k-medoids uses a real data object of a cluster as centroid. Do I miss sth. or is my described solution a valid one if I switch from k-means to k-medoids? – user2653422 Jun 05 '14 at 16:38
  • Yes. About Gower, please read http://stats.stackexchange.com/a/15313/3277 – ttnphns Jun 05 '14 at 19:04
  • Exactly: k-means assumes that "0.212313" is a reasonable representative of your data. Which, for binary indicator variables, does not make sense. Hierarchical clustering, PAM/k-medoids, DBSCAN, ... - use an algorithm that works with arbitrary similarity measures + a distance suitable for your data (Gower is worth a try). – Has QUIT--Anony-Mousse Jun 06 '14 at 12:34
  • Thanks to both of you. One more question. What if I just calculate the hypothetical centroid and then use the real object of the cluster which is closest to this calculated centroid as actual centroid? So pretty much like k-medoids, but with hopefully a faster computing. I should mention that my data set is huge (several hundred MBs) and the used method should be not to slow (less than 10 mins) and shouldn't need GBs of ram. I'm not sure about the capabilities of the mentioned algorithms for these aspects. – user2653422 Jun 06 '14 at 13:36

0 Answers0