0

I know there is same question in cross validated. But it is somewhat different.

Clustering of mixed type data with R

At there Q&A, as using daisy funtion(), we can use categorical data type in clustering.

But, I'm wondering that as sequence the nominal variable (for example, 1 is small apartment, 2 is middle-size apartment, 3 is building and the higher number, the better), can I use kmeans clustering with this nominal variable?

Of course, in this case, this nominal variable is converted as int type(=continuous type).

Please let me know, why it can't or can. I want to know theory explanation.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
서영재
  • 55
  • 9
  • so, you just want to convert a nominal variable to continous? or something more? – carlo Mar 21 '17 at 14:26
  • @carlo Yes. right. But the nominal variable has rank. For example, 1 is small apartment, 2 is middle-size apartment, 3 is building and the higher number, the better. I know that converting nominal variable to continuous is wrong. So i rank the nominal variable's data. Then.. Is it right ?? – 서영재 Mar 24 '17 at 00:41
  • Your example seems to be ordinal rather then nominal. In any case, daisy works fine for what you want to do, I have used the Matlab port of it for my master thesis. Just be careful to accurately tell it which variable is what type. – David Ernst Sep 03 '17 at 21:17

1 Answers1

1

It depends on the desired effect.

For example with k-means, if you encode these values as 1,2,3 the distance of 1 to 3 is 2²=4, i.e., 4 times as much as the differences of 1 to 2, and 2 to 3 (1²=1).

This can be desired, or problematic. It depends on your data's meaning, there is not a single mathematical 'more correct' way.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96