1

Possible Duplicate:
Clustering with a distance matrix

I have a set of data which I wish to cluster.

I have computed a distance measure between each pair of data, but I am limited in that I am unable to compute a measure between each data point and an 'arbitrary' point in space. In addition, the distance measures do not necessarily satisfy the triangle inequality.

I would like to set as a clustering parameter something like a 'minimum distance for two points to be in the same cluster'. I can then find all the edges that satisfy this similarity measure, and treat each remaining subgraph as one cluster.

However, this means that if point A is similar to point B, and B is similar to C, and C is similar to D, point A will end up in the same cluster as point D, even if A and D are very different.

Does anyone have any suggestions as to how I could use this clustering method, but prevent this 'daisy-chaining' of pairwise-similar vectors?

Bill Cheatham
  • 255
  • 3
  • 9
  • 3
    what is the difference between your question and this one: [Clustering with a distance matrix](http://stats.stackexchange.com/questions/2717/clustering-with-a-distance-matrix) – mlwida May 24 '12 at 09:19

2 Answers2

2

Which distance-based clustering methods have you tried?

It sounds like you are exactly looking for DBSCAN: https://en.wikipedia.org/wiki/DBSCAN

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Yes, this does look good, thanks. I am still concerned that maybe 'core' points could end up being a long distance apart from each other, but I will have a go. – Bill Cheatham May 24 '12 at 09:07
  • (In addition, the distance-based clustering methods I had tried was the subgraph method I mention. I realise this is now called single-linkage clustering https://en.wikipedia.org/wiki/Single-linkage_clustering) – Bill Cheatham May 24 '12 at 09:11
  • Single-linkage is much more prone to the single link effect. In DBSCAN you could have a long chain of core points; and there probably is no sensible "mean" of a cluster. But if you set minPts high enough, the cluster should be, well, density connected by a lot of points. – Has QUIT--Anony-Mousse May 24 '12 at 13:30
0

Not really an answer, but it might help. First try to get a true metric, satisfying the triangle inequality... it may be not too hard. Then try clustering your data using a metric tree. I haven't researched carefully the subject, but it seems that the most recent member of that family are cover trees...

dsign
  • 387
  • 2
  • 11
  • This seems like it would help, but I'm not sure my data is suitable. In reality the 'distances' I have are 'confidences of matching', between 0 and 1. So it may be difficult to scale them appropriately. – Bill Cheatham May 24 '12 at 09:09