8

I wish to perform a spatial clustering of scattered data that represents geographic locations of individuals in an urban area. Hierarchical clustering seems to work well, and I have successfully done this for the 100,000 data points in the set.

However, as an additional constraint/objective, we wish that the clusters have equal count. That is, the city center would be a small area spatially, and the suburban areas would be large.

Is there a ready-made solution for this problem?

Eric Brown
  • 376
  • 2
  • 6
  • 1
    Different methods have a stronger or weaker tendency to get approximately equal count, but I don't know of one that has it as a constraint. – Peter Flom Oct 19 '12 at 21:52
  • I wonder _which_ hierarchical method you ventured with as much as 100000 points. Many of the methods are prone to suboptimal solutions on later steps as the number of steps grows. – ttnphns Oct 20 '12 at 10:38
  • 1
    [Here](http://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points) was a similar question. Among suggestions thereof you might notice my comment on using silhouette index to improve a clusterization done, in particular to make clusters more equal by their counts inside. – ttnphns Oct 20 '12 at 10:53
  • 1
    You can use a variant of the (`R` based) [quadtree solution](http://gis.stackexchange.com/a/31879) I posted on the GIS site last August. – whuber Nov 18 '12 at 23:05
  • If you are also requiring the clustered units to be adjacent, then you may find that a brute force search may be feasible. For example, if area A is only next to area B and C (eg top left corner) and you have cluster size of 3 or more, then A, B, and C must be in the same cluster. So you could possibly cluster "from the borders to the centre" of your map. – probabilityislogic Aug 18 '13 at 04:05

1 Answers1

2

This is not really a clustering task anymore. See: clustering is about discovering structure, but you are forcing formal constraints to be much stronger than the structure.

However, you can look at it from a different angle:

A structure that tries really hard to divide the data into non-overlapping, equally sized "bins" is a balanced tree. So you probably want a spatial tree such as the R*-tree. It will try to partition your data into equally sized partitions.

Now in particular for 2D point data, it is much easier to bulk-load the tree using STR instead of using a full R-tree. Plus, it will actually get you non-overlapping partitions that are also really of equal size. The regular R- and R*-trees will only do this within certain thresholds.

The general process you need is very simple (and this is the consequence of your "equal size" constraint!): Compute sqrt(k) when k is your desired number of partitions. Sort your data by the x-axis, and divide it into sqrt(k) equally sized partitions. For each of these partitions, sort that part of the data by the y-axis, and divide it again into sqrt(k) partitions. Then you have sqrt(k)^2 = k partitions that have the same size and do not overlap. Plus, it take just O(n log n) time, and this way is much faster than hierarchical clustering.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • I think this proposal might have missed the point: although there are strong constraints, the quest for *clusters* (all of which should have minimal spatial extent) suggests this coordinate-slicing method will produce less-than-desirable results. As an example of what is conceivable, we could ask for a K-means solution subject to the constraint that all clusters have (nearly) identical cardinalities: this will reveal spatial structuring. (Unfortunately, I know of no good algorithm for this K-means version of the problem.) – whuber Nov 18 '12 at 23:12
  • The R*-Tree or a STR bulk loaded tree will have quite spherical clusters. R*-tree tries to minimize the perimeter of the MBRs, STR bulk load tries to divide all dimensions evenly. R*-tree will also allow some overlap of the boxes: it will still try to assign the points to the best match. If you choose the leaf size to match the desired number of clusters, it will produce a surprisingly good result. K-means clusters actually have much more odd shapes, as they represent Voronoi cells (which already at 3d are somewhat beyond our capabilities of grasping, while we do understand 3d boxes okay) – Has QUIT--Anony-Mousse Nov 19 '12 at 07:58
  • Quite often, k-means results that are mathematically "optimal" appear quite nonsensical to users, IMHO. And in particular, k-means can't really cope with clusters of different spatial size, as it always puts the split line exactly in the middle of the two clusters. This limits (see question) the ability of k-means to adopt to dense city areas and much less dense rural areas. – Has QUIT--Anony-Mousse Nov 19 '12 at 08:02
  • Actually, k-means works decently--provided you can meet the constraints. I tested this out in a prototype about ten years ago using simulated annealing to find the solution. But I'm not really writing to defend k-means; rather, to point out that contrary to your first sentence, this *is* a clustering problem and would benefit from being analyzed as such. – whuber Nov 19 '12 at 15:18
  • I don't think it is a structure discovery (= clustering) problem, but a data partitioning (which is *closely* related to indexing) problem. That k-means is more of a partitioning algorithm than a structure discovery algorithm is a different matter, but adds to the confusion. – Has QUIT--Anony-Mousse Nov 19 '12 at 15:38
  • I don't think the OP is confused! Take a look at the example I gave in a comment to the question: although you are correct that the constraints (in this case, a little more relaxed than requested, but strong constraints nonetheless) partition the data--and this was indeed the purpose in that example--they still reflect spatial clustering. Compact clusters occur where points are dense and more extensive clusters occur where points are not dense. – whuber Nov 19 '12 at 16:17
  • What you are describing is density based data partitioning. Like a QuadTree would do, or an R-tree. Density based clustering would be the task to discovery the cities (make them one cluster!) in a lower density area. Clustering is *not* supposed to arbitrarily break the city center into multiple parts. – Has QUIT--Anony-Mousse Nov 19 '12 at 16:28
  • Fair enough: it sounds like you are using "clustering" in a different sense than intended by the OP (and as I usually understand it). It's fine to make such a distinction--I appreciate that very much and have learned from it--but it looks unfair to me to suggest the OP is confused about what they are asking and, then, in effect, to propose a solution that does not really achieve what the OP is looking for. – whuber Nov 19 '12 at 16:31
  • I'm not calling the OP confused. I'm calling the textbooks confusing here, about what clustering is. It's being used as a fancy word for anything that partitions data, as if "data partitioning" was an insult. – Has QUIT--Anony-Mousse Nov 19 '12 at 17:12