Distance between independent observations of a categorical variable

Question

I have a random variable $T$ that takes values in $\{ \text{blue}, \text{green}, \text{red} \}$, and a number of observations of $T$:

|i  |T     |
|:--|:-----|
|1  |red   |
|2  |red   |
|3  |green |
|4  |red   |
|5  |blue  |
|6  |blue  |
|7  |green |
|8  |red   |
|9  |green |

or a matrix that looks like:

|  i| T_blue| T_green| T_red|
|--:|------:|-------:|-----:|
|  1|      0|       0|     1|
|  2|      0|       0|     1|
|  3|      0|       1|     0|
|  4|      0|       0|     1|
|  5|      1|       0|     0|
|  6|      1|       0|     0|
|  7|      0|       1|     0|
|  8|      0|       0|     1|
|  9|      0|       1|     0|

What is the distance between any two rows of this matrix?

I know that, in general, the Jaccard index is recommended for binary data. Is this distance still meaningful when the columns are mutually exclusive across rows? That is, it reduces to an indicator for concordance: $\operatorname{Jaccard}(T_i,T_j) \in \{0, 1\}\ \forall\ i,j$. Is this nonstandard? Is there a better metric I should use?

Background

I would like to find the distance correlation between a categorical variable $T$ and a continuous variable $X$. As far as I can tell, this is conceptually valid because the distance correlation measures the deviation of $f(T,X)$ from $f(T)f(X)$, which has nothing to do with $T$ or $X$ themselves. Moreover, it is defined in terms of pairwise distances "within" variables, so it doesn't run into the messy issue of explicitly defining a distance between a categorical variable and a continuous one.

Do any special considerations arise for my use case?

Is the order of observations meaningful? If not, you can regard the data as a one-dimensional contingency table. — Kodiologist, Mar 10 '15 at 20:26
@Kodiologist the order is only meaningful insofar as the observations of $X$ and $T$ have the same "case id," but that isn't important because it gets washed out in computing the distance covariance. But I'm afraid I don't see how the contingency table interpretation helps here. — shadowtalker, Mar 10 '15 at 21:03
When you compare cases by a single set of dummy variables Jaccard (as well as most other similarity measures for binary data) obviously can take on only values 1 and 0, so what's the use of it? — ttnphns, Mar 10 '15 at 21:59
@ttnphns well, it becomes a measure of concordance. My question is whether concordance-as-distance makes sense for computing distance covariance. — shadowtalker, Mar 10 '15 at 22:26
@ssdecontrol Then since the order doesn't matter, a perfectly good measure of distance is the difference of the logits of the proportions, no? That comes directly from the contingency-table interpretation. But perhaps I don't understand what question you're trying to answer with distances and distance correlations. — Kodiologist, Mar 11 '15 at 00:05
@Kodiologist the proportion for each i is always 1/N where N is the number of unique values of T — shadowtalker, Mar 11 '15 at 00:20
@ssdecontrol I now realize that you asked about distance between rows rather than distance between columns. Sorry, my mistake. In the case of distance between rows of this matrix, the only sensible metric is the discrete metric (1 for rows that are different and 0 for rows that are the same) unless you know something more about the relationship of the possible values of T to each other. So you were right in not being able to find a better metric. — Kodiologist, Mar 11 '15 at 00:36
Came across this old thread. (1) I agree that **the only** sensible distance is $1$ between any two different rows and $0$ otherwise. (2) You seem to have created [distance-covariance] tag. Would you perhaps consider writing a tag wiki excerpt for it? (3) Your notation and terminology for your random variable $T$ in the first line is strange; I don't think you can write it like that. — amoeba, Oct 14 '15 at 21:51
@amoeba (1) hey, if it's an answer it's an answer. (3) I wish I could tell you what I was thinking in the sleep-deprived depths of writing my thesis. And (2) thanks for the pointer about the wiki. — shadowtalker, Oct 15 '15 at 04:36

Distance between independent observations of a categorical variable

Background

0 Answers0