1

I'd like to assess how scattered a cluster of binary vectors $X_j$ is, and as I understand the conventional way for doing this is:

$$ S = \frac{1}{T} \sum_{j}^{T}\|X_j-A_j\|_p, $$

where $A_j$ is the centroid of the cluster and $\|X_j-A_j\|_p$ is the distance between the centroid and the individual vector.

So my question is how to compute $A_j$ for Gower distances (and if there's an existing R implementation I could use, it'd also be great).

ttnphns
  • 51,648
  • 40
  • 253
  • 462
a11msp
  • 743
  • 6
  • 20
  • 1
    Your question is not clear enough. (1) Do by "centroid" you mean what is usually meant by the word: the multivariate arithmetic mean? Or another sort of centre (what then)? (2) Does `||` in your notation just indicate that the distance b\w the point and the centroid is _squared_? And what distance - is that Euclidean distance? (Note that "Gower distance" is conventionally defined between data points, not between a data point and some centre.) – ttnphns Sep 27 '14 at 09:10
  • 1
    Info about Gower coefficient you might find helpful: http://stats.stackexchange.com/a/15313/3277 – ttnphns Sep 27 '14 at 09:13
  • Thanks for asking for clarification. Perhaps my question would be best phrased as: what is the best measure of centre and distance from it to compute the tightness of a cluster of binary vectors? – a11msp Sep 27 '14 at 12:20
  • You may ask your new question or edit this one accordingly, if you like. One of possible answers might be then: If the data are truly categorical for you so that the idea of an "underlying" continuous traits is unwelcome then cluster can't have any "centre" inside. Its multivariate _mode_ will express its "central tendency". – ttnphns Sep 27 '14 at 12:30
  • You might also explain _why_ you need to know "centre" of a cloud of points in binary space. Do you really need it? – ttnphns Sep 27 '14 at 12:32
  • Thanks. I may not need to know the centre but I need some kind of statistic to compute the tightness of a cluster. The reason for this, in turn, is to be able to perform a permutation test on it to say how non random a cluster this tight is (compared to just a random subset of data) and also how the distribution of cluster tightnesses observed under some conditions differs from that under others. – a11msp Sep 27 '14 at 12:44
  • PS. It's not me (or a computer), it's the nature that performed the clustering. – a11msp Sep 27 '14 at 12:46
  • 1
    One of possible measures of cluster tightness (homogeneity) when features are nominal is **entropy**. To compute: 1) for each category of feature (in your case, your features have 2 categories each), compute proportion of objects falling in that category in this cluster, 2) multiply the proportion by its logarithm, 3) sum up such terms (products) across all the categories and invert sign. That will be cluster's entropy by the current feature. 4) Sum up entropies across all features. The smaller is the quantity the tighter is the cluster. – ttnphns Sep 27 '14 at 14:18
  • Thanks, I actually started thinking about entropy too. Will check it out. – a11msp Sep 27 '14 at 17:39

1 Answers1

1

Methods such as PAM (aka: k-medoids) simply choose the most central object as representative, aka medoid. This works for arbitrary distance functions, so it will also work for Gower.

$$ A:=\mathop{argmin}_{X_j\in D} \sum_i^T d(X_j,X_i) $$

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96