0

I have a seemingly easy question which however is troubling me a bit.

I have couples of vectors made up of nominal attributes. They can be of different length and sometimes some of the attributes in one might not be included in the other. See a and b as two potential examples.

               a
1  mathematician
2       engineer
3  mathematician
4  mathematician
5  mathematician
6       engineer
7  mathematician
8  mathematician
9  mathematician
10 mathematician
11 mathematician
12      engineer
13 mathematician
14 mathematician
15      engineer

               b
1      physicist
2        surgeon
3      physicist
4        surgeon
5      physicist
6      physicist
7        surgeon
8        surgeon
9      physicist
10     physicist
11 mathematician

Do you have in mind a measure that could summarize the dissimilarity between them? The type of measure I am looking for is something like the euclidean distance, but for qualitative vectors.

One option I have in ming is to actually compute the euclidean distance among the categorical vectors transformed into frequence vectors. In this way, they would become quantitative and would be of the same length. But my question is, do you find this a sound approach?

If someone has more ideas, we could do a review of distance measures for nominal vectors!

Riccardo
  • 251
  • 1
  • 2
  • 4
  • If all categories are equally dissimilar to each other (a mathematician is no more like a physicist or an engineer than a surgeon), then for a given total count, I think you're only left with functions of counts of mismatches. – Glen_b Feb 11 '14 at 21:49
  • @Glen_b Thank you. Do you have in mind some specific count functions? – Riccardo Feb 11 '14 at 22:05
  • 1
    The most obvious thing is the proportion of mismatched entries, but there are many other measures around. – Glen_b Feb 11 '14 at 22:13
  • If that are two _vectors_ as you say (i.e. one-to-one correspondence between their rows exist) then they _must_ be of the same length before [a (dis)similarity](http://stats.stackexchange.com/q/55798/3277) can be computed. If, OTOH, they are two _sets_, then it is absolutely another story and you should base yourself on counts. – ttnphns Feb 12 '14 at 04:43

0 Answers0