Clustering -- Intuition behind Kleinberg's Impossibility Theorem

Question

I've been thinking about writing a blog post on this interesting analysis by Kleinberg (2002) that explores the difficulty of clustering. Kleinberg outlines three seemingly intuitive desiderata for a clustering function and then proves that no such function exists. There are many clustering algorithms that satify two of the three criteria; however, no function can satisfy all three simultaneously.

Briefly and informally, the three desiderata he outlines are:

Scale-Invariance: If we transform the data so that everything is stretched out equally in all directions, then the clustering result shouldn’t change.
Consistency: If we stretch the data so that the distances between clusters increase and/or the distances within clusters decrease, then the clustering result shouldn’t change.
Richness: The clustering function should theoretically be able to produce any arbitrary partition/clustering of datapoints (in the absence of knowing the pairwise distance between any two points)

Questions:

(1) Is there a good intuition, geometric picture that can show the inconsistency between these three criteria?

(2) This refers to technical details to the paper. You will have to read the link above to understand this portion of the question.

In the paper, the proof to theorem 3.1 is a bit hard for me to follow at points. I'm stuck at: "Let $f$ be a clustering function that satisfies Consistency. We claim that for any partition $\Gamma \in \text{Range}(f)$, there exist positive real numbers $a < b$ such that the pair $(a, b)$ is $\Gamma$-forcing."

I don't see how this can be the case... Isn't the partition below a counter-example where $a > b$ (i.e. minimum distance between clusters is greater than maximum distance within clusters)?

Edit: this is clearly not a counterexample, I was confusing myself (see answers).

Other papers:

Ackerman & Ben-David (2009). Measures of Clustering Quality: A Working Set of Axioms for Clustering
- Points out some issues with the "consistency" axiom

In regard to "consistency": this characteristic is intuitively desired only when the clusters are already well-separated. When they are not, there is an issue on the number of clusters in the data - for the analysis, since it is unsupervised, it is a question. Then it is quite normal to expect that as you gradually add distance between the clusters (as they were generated by you) the analysis changes the assignments it does during the clustering process. — ttnphns, Sep 20 '15 at 08:59
In regard to "richness": I'm sorry I didn't understand what it means (as least as you've put it). Clustering algorithms are many, how can you expect that they all obey some particular fancy requirement? — ttnphns, Sep 20 '15 at 09:02
In respect to your picture: special clustering methods are needed to recognize such a pattern. Traditional/original clustering methods stem from biology and sociology, where clusters are more or less spheroid dense "islands", not atoll rings. These methods cannot demand to cope with the data on the picture. — ttnphns, Sep 20 '15 at 09:11
You may also be interested in: Estivill-Castro, Vladimir. "Why so many clustering algorithms: a position paper." ACM SIGKDD explorations newsletter 4.1 (2002): 65-75. — Has QUIT--Anony-Mousse, Sep 20 '15 at 17:42
I haven't read the paper. But in many clustering algorithms you have some distance threshold (e.g. DBSCAN, hierarchical clustering). If you scale the distances, of couse you also need to scale your threshold accordingly. Thus, I disagree with his scale-invariance requirement. I also disagree with richness. Not every partition must be a valid solution for every algorithm. There are millions of random partitions. — Has QUIT--Anony-Mousse, Sep 20 '15 at 17:47
@ttphns -- yes you need something like single-linkage to produce the example I gave. But this can satisfy consistency under appropriate stopping conditions. — ahwillia, Sep 20 '15 at 17:52
IMHO, he is A) conflating distances and clustering. You may as well read his article as "there is no perfect distance function". and B) overshooting the target with his requirements. In my opinion, the essay by Estivill-Castro is much more insightful: the requirements of people vary so much, that there cannot be a one-size-fits-all clustering algorithm. — Has QUIT--Anony-Mousse, Sep 20 '15 at 17:54
Richness is useless IMHO. The only examples of clusterings that violate richness are those that enforce exactly k clusters, such as k-means. Even thresholded single-link is rich: distance=0 if they are to be in the same cluster and infty otherwise. Richness is boring, because you can encode the desired output in the distance function. So "richness = must not enforce k clusters". Not very useful, somewhat just to rule out k-means as good solution? — Has QUIT--Anony-Mousse, Sep 20 '15 at 18:46
Regarding question #2, it seems you have `a` and `b` swapped. "for all pairs of points i, j that belong to the same cluster of Γ, we have d(i, j) ≤ a" suggests `a` is the maximum distance within clusters. — xan, Oct 04 '15 at 18:44
"Isn't the partition below a counter-example where a>b (minimum distance between clusters is greater than maximum distance within clusters)" I do not think the image you provide is an example of what you quote. Note that the maximum distance within clusters is probably the distance between 9 o'clock and 3 o'clock red, whereas the minimum distance between clusters is roughly 12 o'clock black to 12 o'clock black; the latter is clearly *not* greater than the former. — Alexis, Jun 23 '18 at 05:36

Communicative Algebra · Answer 1 · 2018-08-21T15:48:23.220

One way or another, every clustering algorithm relies on some notion of “proximity” of points. It seems intuitively clear that you can either use a relative (scale-invariant) notion or an absolute (consistent) notion of proximity, but not both.

I will first try to illustrate this with an example, and then go on to say how this intuition fits with Kleinberg’s Theorem.

An illustrative example

Suppose we have two sets $S_1$ and $S_2$ of $270$ points each, arranged in the plane like this:

$\hskip{6em}$

You might not see $270$ points in either of these pictures, but that’s just because many of the points are very close together. We see more points when we zoom in:

$\hskip{3em}$

You’ll probably spontaneoulsy agree that, in both data sets, the points are arranged in three clusters. However, it turns out that if you zoom in on any of the three clusters of $S_2$, you see the following:

$\hskip{3em}$

If you believe in an absolute notion of proximity, or in consistency, you’ll still maintain that, irrespective of what you just saw under the microscope, $S_2$ consists of just three clusters. Indeed, the only difference between $S_1$ and $S_2$ is that, within each cluster, some points are now closer together. If, on the other hand, you believe in a relative notion of proximity, or in scale invariance, you’ll feel inclined to argue that $S_2$ consists not of $3$ but of $3×3 = 9$ clusters. Neither of these points of view is wrong, but you do have to make a choice one way or the other.

A case for isometry invariance

If you compare the above intuition with Kleinberg’s Theorem, you will find that they are slightly at odds. Indeed, Kleinberg’s Theorem appears to say that you can achieve scale invariance and consistency simultaneously as long as you do not care about a third property called richness. However, richness is not the only property you lose if you simultaneously insist on scale invariance and consistency. You also lose another, more fundamental property: isometry-invariance. This is a property that I wouldn’t be willing to sacrifice. As it doesn’t appear in Kleinberg’s paper, I’ll dwell on it for a moment.

In short, a clustering algorithm is isometry invariant if its output depends only on the distances between points, and not on some additional information like labels that you attach to your points, or on an ordering that you impose on your points. I hope this sounds like a very mild and very natural condition. All algorithms discussed in Kleinberg’s paper are isometry invariant, except for the single linkage algorithm with the $k$-cluster stopping condition. According to Kleinberg’s description, this algorithm uses a lexicographical ordering of the points, so its output may indeed depend on how you label them. For instance, for a set of three equidistant points, the output of the single linkage algorithm with the $2$-cluster stopping condition will give different answers according to whether you label your three points as “cat”, “dog”, “mouse” (c < d < m) or as “Tom”, “Spike”, “Jerry” (J < S < T):

$\hskip{6em}$

This unnatural behaviour can of course easily be repaired by replacing the $k$-cluster stopping condition with a “$(≤ k)$-cluster stopping condition”. The idea is simply not to break ties between equidistant points, and to stop merging clusters as soon as we have reached at most $k$ clusters. This repaired algorithm will still produce $k$ clusters most of the time, and it will be isometry invariant and scale invariant. In agreement with the intuition given above, it will however no longer be consistent.

For a precise definition of isometry invariance, recall that Kleinberg defines a clustering algorithm on a finite set $S$ as a map that assigns to each metric on $S$ a partition of $S$: $$ Γ\colon \{\text{metrics on } S\} → \{\text{partitions of } S\}\\ d ↦ Γ(d) $$ An isometry $i$ between two metrics $d$ and $d'$ on $S$ is a permutation $i\colon S → S$ such that $d'(i(x),i(y)) = d(x,y)$ for all points $x$ and $y$ in $S$.

Definition: A clustering algorithm $Γ$ is isometry invariant if it satisfies the following condition: for any metrics $d$ and $d'$, and any isometry $i$ between them, the points $i(x)$ and $i(y)$ lie in the same cluster of $Γ(d')$ if and only if the original points $x$ and $y$ lie in the same cluster of $Γ(d)$.

When we think about clustering algorithms, we often identify the abstract set $S$ with a concrete set of points in the plane, or in some other ambient space, and imagine varying the metric on $S$ as moving the points of $S$ around. Indeed, this is the point of view we took in the illustrative example above. In this context, isometry invariance means that our clustering algorithm is insensitive to rotations, reflections and translations.

$\hskip{6em}$

A variant of Kleinberg’s Theorem

The intuition given above is captured by the following variant of Kleinberg’s Theorem.

Theorem: There is no non-trivial isometry-invariant clustering algorithm that is simultaneously consistent and scale-invariant.

Here, by a trivial clustering algorithm, I mean one of the following two algorithms:

the algorithm that assigns to every metric on $S$ the discrete partition, in which every cluster consists of a single point,
the algorithm that assigns to every metric on $S$ the lump partition, consisting of a single cluster.

The claim is that these silly algorithms are the only two isometry invariant algorithms that are both consistent and scale-invariant.

Proof: Let $S$ be the finite set on which our algorithm $Γ$ is supposed to operate. Let $d₁$ be the metric on $S$ in which any pair of distinct points has unit distance (i.e. $d₁(x,y) = 1$ for all $x ≠ y$ in $S$). As $Γ$ is isometry invariant, there are only two possibilities for $Γ(d₁)$: either $Γ(d₁)$ is the discrete partition, or $Γ(d₁)$ is the lump partition. Let’s first look at the case when $Γ(d₁)$ is the discrete partition. Given any metric $d$ on $S$, we can rescale it so that all pairs of points have distance $≥ 1$ under $d$. Then, by consistency, we find that $Γ(d) = Γ(d₁)$. So in this case $Γ$ is the trivial algorithm that assigns the discrete partition to every metric. Second, let’s consider the case that $Γ(d₁)$ is the lump partition. We can rescale any metric $d$ on $S$ so that all pairs of points have distance $≤ 1$, so again consistency implies that $Γ(d)=Γ(d₁)$. So $Γ$ is also trivial in this case. ∎

Of course, this proof is very close in spirit to Margareta Ackerman’s proof of Kleinberg’s original theorem, discussed in Alex Williams’s answer.

ahwillia · Answer 2 · 2015-10-03T21:21:51.933

7

This is the intuition I came up with (a snippet from my blog post here).

A consequence of the richness axiom is that we can define two different distance functions, $d_1$ (top left) and $d_2$ (bottom left), that respectively put all the data points into individual clusters and into some other clustering. Then we can define a third distance function $d_3$ (top and bottom right) that simply scales $d_2$ so that the minimum distance between points in $d_3$ space is larger than the maximum distance in $d_1$ space. Then, we arrive at a contradiction, since by consistency the clustering should be the same for the $d_1 \rightarrow d_3$ transformation, but also the same for the $d_2 \rightarrow d_3$ transformation.

edited Oct 03 '15 at 21:21

answered Oct 02 '15 at 17:21

ahwillia

2,406
1
14
26

Do you mean bottom left for d2? One nice thing about your diagram is that it shows how consistency isn't a generally desirable property (or that it's too loosely formulated). – xan Oct 02 '15 at 17:44
Yes bottom left, edited the answer accordingly. Thanks! – ahwillia Oct 03 '15 at 21:27
Before I fully understood your answer, I came up with logic that turns out to be the dual of yours: start with a clustering where all points are in the same cluster. Transform it into any other arrangement by shrinking it into a miniature version of any other arrangement and scaling it up to a full-size version of the other arrangement. – xan Oct 04 '15 at 17:51

Clustering -- Intuition behind Kleinberg's Impossibility Theorem

2 Answers2

An illustrative example

A case for isometry invariance

A variant of Kleinberg’s Theorem