0

I have a series of coordinates. I know that I can cluster these using some of the basic methods like k-means or hierarchical clustering. I can also easily find out which is the closest neighbor to each coordinate.

However, how do I split the data into clusters, so that each cluster is exactly of size n, and that each coordinate only belongs to one cluster?

How could I do this in R, for example on this data and Euclidean distances:

data(iris)
plot(iris$Sepal.Length, iris$Sepal.Width)
Figaro
  • 1,042
  • 2
  • 12
  • 24
  • 2
    Requiring exactly equal `n` is a bit unusual/artificial wish towards a cluster analysis. Different answers could follow depending on the question why you need so and how equal `n` is allowed to interfere with the optimization function of the procedure. But you don't say a word about that. – ttnphns Dec 30 '16 at 14:59
  • 1
    Most immediate, blunt recommendation, given the broad question or yours, might be to do any (e.g. k-means) cluster analysis you like to obtain k clusters. Then run through cases to re-assign "close outsiders" so that clusters move towards to equal `n`. After each case reassignment you might want to recompute centroids to update the list of candidates for the reassignment. – ttnphns Dec 30 '16 at 14:59
  • You are right, my task is a bit arbitrary, it's not the usual way to cluster based on distance, but it needs to be done this way. The solution doesn't necessarily have to be optimal in terms of some distance measurement, but close enough, grouping close by coordinates. – Figaro Dec 31 '16 at 23:19

1 Answers1

1

There exist variations of e.g. k-means that produce clusters of the same size. See e.g.

The first link is probably the easiest to use. Get the Java softwarey run the algorithm.

Beware: fixing the cluster sizes contradicts the usual clustering objectives of "discovering structure" because the data will usually not consist of clusters of the same size. The objective of clustering is to discover the size, number, location, and shape of such structures in your data. If you know that your data has k clusters of n points each, it may be better to treat this as an optimization rather than a clustering problem.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96