2

I have the following problem - i have a range of numbers (ie. [1, 5, 7, 8, 15, 29, 100]). I need to cluster them OPTIMALLY (not local optimum as in lloyd algorithm) and better than NP time, minimizing as in k-means problem (to find points such that squared distances from clustroids to points of clusters are minimized).

So far I found only some paper on the dynamic solution for k-means but it seems too difficult. Can someone suggest some possible way to do this.

SWIM S.
  • 1,006
  • 9
  • 17
  • 1
    What does "optimal" mean? That is, what is your objective function? Do you wish to specify the number of clusters or not? Perhaps you are looking for [Jenks' method](http://stats.stackexchange.com/search?q=Jenks)? – whuber Nov 01 '16 at 22:39
  • yes, number of clusters is given, let it be k. So minimization is as follows: sum ||xi - ci||^2, where difference is calculated for all points in the cluster, and summed for all clusters. I guess this is normal k-means minimization problem - like in wiki: https://en.wikipedia.org/wiki/K-means_clustering#Description – SWIM S. Nov 01 '16 at 22:45
  • Thx for suggestion. I think I already ran across Jenks method while lookign for answers, but it seems this should be possible to do deterministically without statistical methods. Or I dont quite understand what Jenks method is. Does it guarantee optimality? an also is it difficult to implement? This one should for sure have some simple answer, which i cannot find =) or not =) – SWIM S. Nov 01 '16 at 22:50
  • Check: http://stats.stackexchange.com/questions/163778/how-do-you-find-a-cutting-point-strong-slope-within-one-dimensional-data/163787#163787 – Tim Nov 02 '16 at 08:06

2 Answers2

4

The R package Ckmeans.1d.dp provides a dynamic programming approach to find the optimal solution to the k-means objective in one dimension. It's discussed in:

Wang and Song (2011). Ckmeans.1d.dp: Optimal $k$-means Clustering in One Dimension by Dynamic Programming. The R Journal Vol. 3/2. pdf

This may be what you meant by "I found only some paper on the dynamic solution for k-means but it seems too difficult," but you may not have realized that there's just an R package you can call to avoid having to implement the algorithm yourself. It's really not that bad to implement, anyway, and I doubt there's going to be a simpler optimal algorithm in polynomial time.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • yes this is exactly what i found. But I cant use the algo per se. I need to understand how it works to prove it (well not me, but i need to explain it in my own words). So i thought there could be something simpler. As of this one, I cannot understand the part of dynamic matrix initialization and how they update cells with their dynamic formula. – SWIM S. Nov 01 '16 at 23:26
3

1-dimensional data is a lot easier, and the problem is not NP-hard in 1 dimension.

Try the following approach:

  1. Sort the data!

  2. Rather than assigning points to clusters, you partition the data into k non-empty intervals. So for $k=2$, you would put one point into the first interval, all others into the second. Then put two points into the first, all others into the second. And so on. There are several shortcuts you can implement for this, I'm not going to detail this.

The cost of this is $O(n \ log(n))$ for sorting (with tiny constants) and $O(n^{k-1})$ (without improvements) for enumerating all possible partitions. Thus, for small $k$, it is not exponential. (For large $k$, the bound may actually be $O((n-k)^{k-1})$ by disallowing empty partitions.)

There are some fairly easy strategies to avoid having to check all such partitions; but exploiting order is the easiest to avoid $O(k^n)$ cost.

hh32
  • 1,279
  • 1
  • 8
  • 19
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96