Suppose to have a dataset containing feature vectors representing some people. Each feature vector contains mixed type of attributes (e.g. sex, age, height, hair color, favourite film, ...).
For example, these could be some instances:
["Male", 28, 1.72, "brown", "star wars rebels", ...],
["Female", 33, 1.65, "blonde", "seven pounds", ...],
["Female", 19, 1.60, "blonde", "star trek", ...],
["Male", 37, 1.84, "black", "star wars return of the jedi", ...],
...
I want to cluster these people to find groups of similar people. There are several type of attributes. In particular, the sex is a binary attribute, the age is a discrete value, the height is a continuous value, the hair color is a categorical (aka nominal) attribute (i.e. has a finite set of values), but the last attribute is a string with infinite values. I found nothing in letterature about string values.
I'm not looking for a particular algorithm, because I'm interested to experiment with several algorithms and not only one. A large part of the clustering algorithms work only with numeric attributes. In this case, categorical attributes can be a problem, but there are some methods to represent them in a numeric form (e.g. 1-of-n, aka one-hot encoding). The real issue is that I can't understand how to handle the string attributes, obviously I can't use the 1-of-n technique. I tried to search, but I didn't find anything.
Maybe an approach could be to use a certain hash function to "convert" the strings into numbers? I think this solution could lead to a wrong clustering (depending also from the hash function)