I'm trying to cluster meaningfully a set of objects characterized by a vector space (bag-of-words) model. Each of those 5000 objects has 1-8 features ("words") from a set of 5500 possible. I used a vector space model ($A_i = 1$ if feature $i$ is present) and cosine distance as a dissimilarity measure, $d (A, B) = \sqrt{2 - 2 cos (A, B)}$ where $$cos (A, B) = \frac{\sum{A_i B_i}}{\sqrt{\sum A_i^2 \sum B_i^2}}$$
No matter whether I apply R's pam
or hclust
/ agnes
(with cutree (k = K0)
) to the dissimilarity matrix, I seem to get one big, degenerate cluster (thousands of members) and several small ones, unless I crank up the number of clusters to many hundreds (10% the number of objects or so). I think one problem is that most objects have no features in common and thus sit at the maximal distance. What can I try?
Update: I've summarized the number of neighbors sitting at non-maximal distance ($\sqrt{2}$) and I got this (meaning one object has 13% of all objects at less than $\sqrt{2}$ distance, while the median object has 0.5% of all objects at less than $\sqrt{2}$):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003766 0.0015070 0.0054610 0.0163400 0.0222200 0.1298000