Can we always get an optimal $k$-means cluster arrangement?

Question

I am currently studying $k$-means clustering. An optimal $k$-cluster arrangement is defined as follows:

Fix a distance $\Delta$ and $k < n$. Assume $\mathbb{X}$ have been partitioned into $k$ clusters $\mathcal{C}_\nu$ with cluster centroids $\mathbf{\overline{X}}_k$, and $\nu \le k$.
A $k$-cluster arrangement $\mathcal{P}$ for $\mathbb{X}$ is the collection
$\mathcal{P} = \mathcal{P}(\mathbb{X}, \Delta, k) = \{ \mathcal{C}_\nu : \nu = 1, \dots, k \}$.
Write $W_\mathcal{P}$ for the within-cluster variability of of $\mathcal{P}$.
A $k$-cluster arrangement is optimal if $W_\mathcal{P} \le W_{\mathcal{P}^\prime}$ for every $k$-cluster arrangement $W_{\mathcal{P}^\prime}$ of $\mathbb{X}$ that uses $\Delta$ and
$$W_{\mathcal{P}} = \sum_{\nu = 1}^k \sum_{\{ \mathbf{X}_i \in \mathcal{C}_\nu \}} \Delta( \mathbf{X}_i, \mathbf{\overline{X}}_\nu)^2.$$

Can we always get an optimal arrangement? I'm thinking that, if the data is really "messy", then there might not be any discernible "clusters"; but I'm wondering what more experienced people think.

Yes, one of the well-known problems with k-means is that it always returns a solution, regardless of whether or not the resulting clusters have any real meaning in the data. Secondarily, the shape of the clusters returned by k-means are always spherical, again, regardless of whether or not a multidimensional sphere reflects real behavior. It's always a good idea to try more than one clustering algorithm and cross-validate the results with some form of discriminate analysis using a train-test approach. — Mike Hunter, Oct 11 '20 at 14:18
"the shape of the clusters returned by k-means are always spherical" - this is not true. If you have three very well separated clusters with similar amount of within-cluster separation, k-means with $k=3$ will return these even if one or more of them are not spherical. k-means likes including large distances in a cluster even less than deviations from the spherical shape, so if separation is strong enough k-means will reluctantly respect it. — Christian Hennig, Oct 11 '20 at 17:31
@lewian Please see this CV post https://stats.stackexchange.com/questions/63558/difference-between-standard-and-spherical-k-means-algorithms/63650 — Mike Hunter, Oct 12 '20 at 01:14
I read it, but what does it have to do with the discussion here? — Christian Hennig, Oct 12 '20 at 10:54
@lewian Your toy example of 3 naturally occurring clusters is extremely atypical wrt actual, applied data. The OPs point about "messy" data is appropriate for nearly all multidimensional, real world data which form a vague, ill-defined cloud of points. K-means tries to carve out condensations (clusters) of information that minimize the solution. This solution may be statistically *optimal* but these results may or may not conform to real behavior. Hence, the importance of cross-validating the results. Implementing a k-means solution requires at least two *a priori* decisions. — Mike Hunter, Oct 13 '20 at 19:01
First, the # of clusters to search for and, second, seed clusters to act as start values upon which the solution optimizes. The centroids formed by the algorithm represent the averaged center for a cluster within this cloud of points. This centroid has average radii extending around it which represent the soft boundaries of a cluster in the multidimensional space. By definition, the boundaries of these radii define a sphere around the centroid. Outlier points beyond these bounds are assigned to the nearest cluster based on minimum distance. This is true even for your toy example. — Mike Hunter, Oct 13 '20 at 19:01
@MikeHunter: You said the shape of clusters returned by k-means is *always* spherical. That's a general statement and it is wrong if there exists a counterexample, which I gave. To what extent this is "typical" doesn't matter. In fact k-means in general returns a Voronoi tesselation of Euclidean space, which is typically not perfectly spherical (the definition of spherical I use here is that every cluster can be described as the points in a spherical subset of Euclidean space; google Voronoi k-means to see pictures). The rest of what you wrote is correct but not in conflict to what I wrote. — Christian Hennig, Oct 14 '20 at 14:24
@lewian Your toy example and its minimum distance assignment doesn't falsify the statement that k-means clusters are spherical. The radii formed around a cluster centroid have uniform distance. Objects outside the bounds of those radii are assigned to a cluster based on minimum distance. — Mike Hunter, Oct 14 '20 at 15:59
Yeah, I made the mistake of changing my definition of "spherical" between my first posting and the previous one. In the first posting I had clusters in mind of which the actual data points are spherically distributed. Anyway, counterexamples can be found for both definitions. Have you looked at some of the pictures you find when you google Voronoi k-means? Points between two means that are not too far from each other are assigned to means at smaller distance, meaning that the "radius" of a cluster is smaller in the direction of another cluster's mean than where no other mean is found. — Christian Hennig, Oct 14 '20 at 21:31
@lewian Thanks for that suggestion. Voroni k-means is new to me and looks quite interesting. — Mike Hunter, Oct 15 '20 at 11:03
Voronoi k-means is not a method, rather you find links and pictures that explore the connection between k-means and so-called Voronoi tesselations. — Christian Hennig, Oct 15 '20 at 13:14

score 4 · Accepted Answer · answered Oct 11 '20 at 11:10

Note that in a proper definition of $k$-means the distance $\Delta$ has to be the Euclidean distance, despite the fact that in some literature these days it is defined using any distance. The reason is that only for the Euclidean distance (or equivalent distances) the means are actually the optimum centroids. You can try to solve the optimisation problem with other distances, but then the centroids would need to be defined differently, and the term $k$-means would no longer be justified. You can also solve the optimisation problem assuming that the centroids are means (which is how your notation looks like), but this will give you an overall suboptimal solution in the non-Euclidean case that could be improved by choosing better centroids, and is therefore not a good method.
Regarding your question, there are three different issues here.

(2a) As there are only finitely many clusterings, one could in principle run through all clusterings and find the optimum solution of the objective function $W_P$, which means that this always exists, although there may be situations in which it is not unique (meaning that two different clusterings may end up with exactly the same value of $W_P$, however with continuous data this will hardly ever happen).

(2b) In practice finding the optimum solution of (2a) may be computationally very hard, so normally (unless the dataset is very small) algorithms are used that only are guaranteed to find a local optimum, which is not necessarily the global one.

(2c) The fact that the optimum of $W_P$ is mathematically guaranteed to exist on a finite data set does not mean that the resulting clustering is "good" in any other respect. The optimisation of $W_P$ defines what, according to $k$-means, a good clustering is, and according to this definition there is always a "best" clustering. However you may be interested in other aspects of clustering such as separation of clusters from other clusters that are not directly taken into account in the definition of $W_P$, and in this respect the "best" $k$-means clustering may indeed not be good. This actually does not only apply to very "messy" datasets, but also to datasets that are intuitively nicely clustered, but where clusters have, for example, strongly different within-cluster variation or are nonlinear. In such situations $k$-means would not be appropriate as a clustering method (at least if you want to find the nice clusters rather than the optimum of $W_P$).

I'm not sure that I'm understanding what you're saying. So the optimum $W_\mathcal{P}$ *is* guaranteed to exist? But, I think what I'm more-so asking is, **in practice**, is it always possible to get an optimum $k$-cluster arrangement? This wasn't clear to me from reading your answer. — The Pointer, Oct 12 '20 at 08:41
As I said, the optimum is guaranteed to exist, but it may not be unique. Regarding computation, existing algorithms, except of for very small data, will only deliver local optima. If you initialise them randomly often enough, you have in simple situations (not too many observations or dimensions) a good chance to find the global optimum, although this cannot be guaranteed and you won't know it for sure. In bigger datasets I'd think that often the global optimum is not found. I can't be more precise because this depends on all kinds of characteristics of the data and the specific algorithm. — Christian Hennig, Oct 12 '20 at 10:51
Ok, I understand. Thanks for the clarification. Can you please clarify a bit on what you mean by "local optima" in the context of $k$-means? I understand what local/global optima means in general, but I'm not totally sure what it means in the specific case of $k$-means. — The Pointer, Oct 12 '20 at 11:15
By "local optimum" here I mean a solution that cannot be improved by further steps of the same algorithm. What exactly that means depends on the specific algorithm. (For discrete optimisation problems like this one the concept of a "local optimum" is not quite as nice and clear as for continuous functions in continuous space.) — Christian Hennig, Oct 12 '20 at 11:32
I've read back over this answer, and I've summarised it in my own words as follows: [...] — The Pointer, Nov 01 '20 at 15:14
[...] Since there are a finite number of possible clusterings, it is theoretically possible to find a global optimum solution. However, in practice, unless using a small dataset (that is also low-dimensional), finding the global optimum is computationally intensive, and so the algorithms that are used in practice only guarantee to find a local optimum. Even so, in the case of small datasets (that are also low-dimensional), if you run the algorithms repetitively and initialise them randomly, then you have a decent chance of finding the global optimum. — The Pointer, Nov 01 '20 at 15:14

Can we always get an optimal $k$-means cluster arrangement?

1 Answers1