2

Suppose to have a dataset containing feature vectors representing some people. Each feature vector contains mixed type of attributes (e.g. sex, age, height, hair color, favourite film, ...).

For example, these could be some instances:

    ["Male", 28, 1.72, "brown", "star wars rebels", ...],
    ["Female", 33, 1.65, "blonde", "seven pounds", ...],
    ["Female", 19, 1.60, "blonde", "star trek", ...],
    ["Male", 37, 1.84, "black", "star wars return of the jedi", ...],
    ...

I want to cluster these people to find groups of similar people. There are several type of attributes. In particular, the sex is a binary attribute, the age is a discrete value, the height is a continuous value, the hair color is a categorical (aka nominal) attribute (i.e. has a finite set of values), but the last attribute is a string with infinite values. I found nothing in letterature about string values.

I'm not looking for a particular algorithm, because I'm interested to experiment with several algorithms and not only one. A large part of the clustering algorithms work only with numeric attributes. In this case, categorical attributes can be a problem, but there are some methods to represent them in a numeric form (e.g. 1-of-n, aka one-hot encoding). The real issue is that I can't understand how to handle the string attributes, obviously I can't use the 1-of-n technique. I tried to search, but I didn't find anything.

Maybe an approach could be to use a certain hash function to "convert" the strings into numbers? I think this solution could lead to a wrong clustering (depending also from the hash function)

ttnphns
  • 51,648
  • 40
  • 253
  • 462
RobotMan
  • 121
  • 2
  • 1
    Please look for `clustering mixed type variables` on this site. – ttnphns Sep 14 '16 at 07:34
  • Have you seen this one? http://stats.stackexchange.com/questions/130974/how-to-use-both-binary-and-continuous-variables-together-in-clustering – MFR Sep 14 '16 at 07:45
  • The gower distance isn't good. In fact, it can't handle the string attributes. For the nominal attributes, it uses the Dice coefficient that is calculated recoding them into dummy variables. This approach isn't possible with a "free text" attribute. – RobotMan Sep 14 '16 at 08:22
  • I am not aware of anything that would work and yield meaningful results. Whatever you do, I doubt you can give any *statistical* guarantees that it is good. – Has QUIT--Anony-Mousse Sep 17 '16 at 18:01
  • I think this problem needs to be approached by 1) defining what a good clustering is, only then 2) checking which algorithms find good clusterings. You gain nothing if you just manage to run some algorithms, if they don't solve the problem. – Has QUIT--Anony-Mousse Sep 17 '16 at 18:02

0 Answers0