One huge cluster + small ones with vector-space model + cosine distance

Question

I'm trying to cluster meaningfully a set of objects characterized by a vector space (bag-of-words) model. Each of those 5000 objects has 1-8 features ("words") from a set of 5500 possible. I used a vector space model ($A_i = 1$ if feature $i$ is present) and cosine distance as a dissimilarity measure, $d (A, B) = \sqrt{2 - 2 cos (A, B)}$ where $$cos (A, B) = \frac{\sum{A_i B_i}}{\sqrt{\sum A_i^2 \sum B_i^2}}$$

No matter whether I apply R's pam or hclust / agnes (with cutree (k = K0)) to the dissimilarity matrix, I seem to get one big, degenerate cluster (thousands of members) and several small ones, unless I crank up the number of clusters to many hundreds (10% the number of objects or so). I think one problem is that most objects have no features in common and thus sit at the maximal distance. What can I try?

Update: I've summarized the number of neighbors sitting at non-maximal distance ($\sqrt{2}$) and I got this (meaning one object has 13% of all objects at less than $\sqrt{2}$ distance, while the median object has 0.5% of all objects at less than $\sqrt{2}$):

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.0003766 0.0015070 0.0054610 0.0163400 0.0222200 0.1298000

When two objects have no features in common, how would you propose to assess their similarity? What exactly would "meaningfully" amount to in your application? This doesn't seem to be a question that can be solved by statistics alone with just the information you have provided. — whuber, Apr 21 '14 at 21:50
$d(A,B)$ as defined above is a metric. When two objects have no features in common it yields the maximum distance between two normalized vectors, which is $\sqrt{2}$ (since $cos = 0$) — Anul, Apr 21 '14 at 21:52
Yes, but there are many metrics on that space to choose from. Yours is a question about which one might be "meaningful" and that comes down to *your* sense of meaningfulness. It would help for you to share some information about that with us. — whuber, Apr 21 '14 at 21:55
I see. Well, I'd like to get more balanced clusters even when k is small. The objects have been (previously) clustered by hand (using an ad-hoc, implicit taxonomy to produce something like a dendrogram -- files organized in folders and subfolders). Now I have these labels (which come from an automated source) and I was hoping to automate the clustering. I realize that the sparsity of labels can be a problem. — Anul, Apr 21 '14 at 22:13
Perhaps you could re-cast your question as one of classification: to wit, how to up with a formula that tends to agree with the *ad hoc* labels. This would open it up to broader and more creative solutions. It could help us to know more about how the *ad hoc* classification works (because I suspect it might rely on more information than you have presented here, such as additional characteristics of the features). — whuber, Apr 21 '14 at 22:16
Yes, I could use the ad-hoc labels, but I was hoping that a more "objective" clustering would emerge by using the new labels. Of course people's intuition captures more implicit features about the objects -- what's the point? Anyway, I suspect the question is dead. I was hoping for someone to comment on the presence of a single degenerate cluster with the vector space model (is this common, are there better measures that avoid this effect etc) — Anul, Apr 22 '14 at 04:12
Your distance is euclidean distance [computed from cosine](http://stats.stackexchange.com/a/36158/3277). It is a reasonable choice because almost all known clustering methods can work on or imply euclidean distance. But many of them require _squared_ distances to input. As for your "one big cluster"... - who said that clusters must be of the same size? And why must your data be consisting of clusters at all? — ttnphns, Apr 22 '14 at 05:22
I was worried about squared vs linear, but I read that both `pam` and `hclust` take as arguments whatever `dist` / `daisy` output (and they output linear distances, e.g. `dist (data.frame(x = c(1, 0), y = c (0, 1)))` outputs \sqrt{2}). The data has been classified manually for years in more-or-less balanced clusters (or at least without a catch-all cluster) so it should be possible. I realize that the features provided by the automated source might not be enough though. — Anul, Apr 22 '14 at 07:14

score 3 · Accepted Answer · answered Apr 22 '14 at 11:37

Maybe your data just doesn't contain clusters?

Not all data (or data representations!) is clustered. Random numbers, for example, aren't.

Maybe, a large part of your data just is a large blob?

This happens, and you may have to live with this...

If 99% of your users are of type A, I would expect the largest cluster to have 99% of objects. Are your sure your data is supposed to consist of groups of equal size?

Maybe you have a lot of outliers

Also a very common thing, and that causes many methods to fail. k-means for example is known to be not very robust against outliers... and in single-link clustering, you usually get a lot of singleton objects (clusters with a single member), before you find any really interesting cluster.

In real data, even when you have clusters A, B and C, you may still have 10% or more objects that are neither of type A, B or C!

Your data representation may be distracting / unhelpful

If you have 5500 features, chances are that some of them are useless. If too many are useless, this may just ruin your analysis.

You may need to spend more time preprocessing your data. For example, if you have 5500 attributes corresponding to colors (blue, light blue, navy blue, royal blue, night sky blue, dark blue, ...) you may need to merge similar attributes first! Similar with text, it is a best-practise to do e.g. stemming to merge different variants of the same word (go, going, gone, goes, ...) into one feature.

Preprocessing is usually key to get good results!

Consider other tasks

Maybe you aren't actually looking for clusters, but e.g. association rules. On sparse binary data, association rules often provide more insight than clusters; in particular if the clusters are based on averaging and strict partitions. Association rules may for example overlap - a single item can exhibit two common traits (= rules). Or none (i.e. noise, singleton, ...)

Thanks. I've updated my question to include a summary for the number of "close neighbors" per object. I'm not sure if this means I have a lot of outliers... but what can I do do protect hclust / k-means / pam (partition around medoids) against outliers? On association rules, would the same idea be captured by fuzzy clusters? — Anul, Apr 23 '14 at 06:51