1

I have a data set of categorical data where each question can have more than one answer. This is a toy example:

question one: what did you eat today?

subject 1 :  Potatoes, Apples

subject 2 : Apples

subject 3 : Honey, Potatoes, Apples

question two: which TV shows do you watch?

  subject 1 :  House, Dexter

  subject 2 : TWD, Dexter

  subject 3 : The news

I need to get a similarity measure between, for example, distfun(Potatoes and Apples, Honey and Apples). I dont think that ordinary multivariate categorical analysis measures deal with this kind of responses. Can someone share some light over which measures use by referring me to some explanatory paper that he knows please?

Thanks in advance

EDIT: 1.- The order of the responses is not important. 2.- I am not just looking for a way to solve the issue ASAP by using the a modified version of the "overlap" distance that is used for multivariate categorical data but to find the framework that deals with this kind of stuff. Thanks for your responses.

Usobi
  • 111
  • 2
  • What would such a distance function look like, even theoretically? Perhaps it would be good if you tell us what you need this distance function for; that is, what are your hypotheses/research questions? – Peter Flom Apr 29 '13 at 10:24
  • the data set is a number of linguistic attributes of certain towns. Each linguistic attribute has a number of arbitrary responses per town. – Usobi Apr 29 '13 at 10:31
  • Let take as example the so called "overlap" measure in multivariate categorical data. How can I apply something "similar" here and where does it appear. Thanks for the help – Usobi Apr 29 '13 at 10:33
  • Well, you could count the number of terms in common, perhaps weighting by the frequency of such an overlap, but that's just an idea, purely on my intuition. Maybe someone else will have a better answer. – Peter Flom Apr 29 '13 at 10:34
  • You have "multiple response" question(s). So, format your data as binary variables, each response category is a variable (e.g. "potatoes" var, "apples" var) and values are 1 (selected) or 0 (not selected). Then choose among plenty binary association measures (see e.g. [here](http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fsyn_proximities_measures_binary_data.htm)). – ttnphns Apr 29 '13 at 11:44
  • Yes! exactly, that kind of analysis is the one I was looking for. I was looking for, now I can program this stuff and pick the association measure that I consider good. Many thanks! – Usobi Apr 29 '13 at 12:58
  • In case you find pleasure to program calculation of such coefficients yourself instead of using software, look [here](http://stats.stackexchange.com/q/49453/3277). – ttnphns Apr 29 '13 at 13:23

1 Answers1

3

If the order of the answers does not matter, you could count the number of common items, so that e.g. d=dist(P and A, H and A) = 1. (this measures the similarity rather then the distance, you may invert it , i.e. 1/d, to get the opposite direction). If the order matters (e.g. if it is a prioritized list), you could use something like the Hamming distance, or the Kendall distance.

Simen Gaure
  • 674
  • 5
  • 7
  • 1
    Thanks, indeed the order does not matter. I could use a measure like the one you proposed but one of my main problems is that I do not know on which framework I am. It is not "strict sensu" multivariate categorical data and I would like to get more info about this. – Usobi Apr 29 '13 at 10:51