If k-means clustering is a form of Gaussian mixture modeling, can it be used when the data are not normal?

Question

I'm reading Bishop on EM algorithm for GMM and the relationship between GMM and k-means.

In this book it says that k-means is a hard assign version of GMM. I'm wondering does that imply that if the data I'm trying to cluster are not Gaussian, I can't use k-means (or at least it's not suitable to use)? For example, what if the data are images of handwritten digits, consisting of 8*8 pixels each with value 0 or 1 (and assume they are independent thus it should be mixture of Bernoulli)?

I'm a little bit confused on this and will appreciate any thoughts.

If you are asking whether it is valid to perform k-means clustering on non-normal data, the answer is yes if the data are assumed to be continuous. Binary data isn't continuous. Some people do k-means on such data, which thing is heuristically permissible, but theoretically invalid. — ttnphns, Sep 07 '13 at 11:12
There's no probability model for k-means so there's no normality assumption to invalidate. (doesn't mean it will work well though) — conjectures, Sep 08 '13 at 19:34
@conjectures Hmm... But k-menas is equivalent to GMM, and GMM assumes normal. — Eddie Xie, Sep 09 '13 at 01:54
@ttnphns Thanks for your answer! So I guess if I use TF-IDF to transfer text into scores and make it continuous then I can apply and it's valid? — Eddie Xie, Sep 09 '13 at 01:56
I suddenly realize that GMM is mixture (sum of) a few gaussians and it should able to express whatever distribution given enough mixtures. Thus, even GMM and K-means are equivalent does not mean K-means can't use non-normal data because GMM can express whatever distribution. Is that correct? — Eddie Xie, Sep 09 '13 at 03:29
I remember that there is a very good lecture of Prof. Pedro Domingos on Coursera, where he compare K-means, Mixtures of gaussians and Bayes Nets. It was of great help for me: https://class.coursera.org/machlearning-001/lecture/preview_view/137 — Daniele, Jan 17 '14 at 17:14

Has QUIT--Anony-Mousse · Answer 1 · 2013-09-08T14:47:42.773

25

In typical EM GMM situations, one does take variance and covariance into account. This is not done in k-means.

But indeed, one of the popular heuristics for k-means (note: k-means is a problem, not an algorithm) - the Lloyd algorithm - is essentially an EM algorithm, using a centroid model (without variance) and hard assignments.

When doing k-means style clustering (i.e. variance minimization), you

coincidentally minimize squared Euclidean distance, because WCSS (within-cluster sum of squares) variance contribution = squared euclidean distance
coincidentally assign objects to the nearest cluster by Euclidean distance, because the sqrt function is monotone (note that the mean does not optimize Euclidean distances, but the WCSS function)
represent clusters using a centroid only
get Voronoi cell shaped clusters, i.e. polygons
it works best with spherical clusters

The k-means objective function can be formalized as this: $$\text{argmin}_S \sum_{i=1}^{k} \sum_{x_j \in S_i} \sum_{d=1}^{D} \left(x_{jd} - \mu_{id} \right)^2$$ where $S=\{S_1 \ldots S_k\}$ are all possible partitionings of the data set into $k$ partitions, $D$ is the data set dimensionality, and e.g. $x_{jd}$ is the coordinate of the $j$th instance in dimension $d$.

It is commonly said that k-means assumes spherical clusters. It is also commonly acknowledged that k-means clusters are Voronoi cells, i.e. not spherical. Both are correct, and both are wrong. First of all, the clusters are not complete Voronoi cells, but only the known objects therein. There is no need to consider the dead space in-between the clusters to be part of either cluster, as having an object there would affect the algorithm result. But it is not much better to call it "spherical" either, just because the euclidean distance is spherical. K-means doesn't care about Euclidean distance. All it is, is a heuristic to minimize the variances. And that is actually, what you should consider k-means to be: variance minimization.

edited Sep 08 '13 at 14:47

answered Sep 07 '13 at 21:08

Has QUIT--Anony-Mousse

39,639
7
61
96

Let me suggest you to refine a bit some of your expressions - for more accuracy. For instance, what is to `minimize squared euclidean distance` or `minimize the variances`? There must be words "sum of" or "pooled" or such, because we have 2+ clusters, isn't it? – ttnphns Sep 08 '13 at 09:26
BTW, since k-means minimizes the pooled within-cluster sum of d^2 _divided_ by the number of objects in the respective cluster, your point `coincidentally minimize Euclidean distance, because the sqrt function is monotone` is, to be precise, not correct. – ttnphns Sep 08 '13 at 09:47
The proper objective function, for which you can prove convergence, is WCSS, **within-cluster sum-of-squares**. And indeed, it doesn't minimize Euclidean distances, but it the nearest-centroid-by-euclidean distance is also the WCSS optimal assignment. – Has QUIT--Anony-Mousse Sep 08 '13 at 14:40
Your wording remains unfortunately _dubious_. What does phrase `minimize squared Euclidean distance, because WCSS variance contribution = squared euclidean distance` _mean_? Are you saying "squared d's _between the objects_ in clusters get minimized because WCSS of deviations get minimized", or just "WCSS of deviations get minimized, which - the deviations - _are_ euclidean distances by nature"? Or smth else? – ttnphns Sep 08 '13 at 16:18
If you want the math part, use the **equation**. If you want the *intuition*, you have to live with ambiguity, and figure out the details yourself. I have yet to see a mathematically 100% correct term (including all prelimiaries) that is intuitive. – Has QUIT--Anony-Mousse Sep 08 '13 at 18:31
I want neither. I just asked you to be precise in your wording and use words like "summed" or "pooled". I also expected you to reveal whether you realize or not, in your 1st point, the issue of the tie between the _pairwise squared distances_ and the _squared deviations from centroid_, - that _same_ issue that I put out in [this](http://stats.stackexchange.com/a/49517/3277) long discussion with you. – ttnphns Sep 08 '13 at 19:06
I have neither heard anyone use "summed" or "pooled", so why would I start doing so here. I don't see why I need to adopt your terminology. – Has QUIT--Anony-Mousse Sep 08 '13 at 19:09
1

Obviously, k-means is a good choice only if you want a centroid model of your data. If you want to optimize pairwise distances, use hierarchical clustering. – Has QUIT--Anony-Mousse Sep 08 '13 at 19:10
I see that you are again reluctant to admit your lacuna. OK, I won't continue. But I'll **repeat** it anyway, that the sum of squared deviations from centroid is _exactly_ related to the sum of the squared pairwise euclidean distances. So, k-means _does_ try to optimize pairwise distances. – ttnphns Sep 08 '13 at 19:20
The question doesn't really answer the title of the question?? – SmallChess Feb 20 '17 at 10:34
@StudentT it answers implicitly: k-means cannot even cluster all normal data sets. – Has QUIT--Anony-Mousse Feb 20 '17 at 19:32

score 8 · Answer 2 · answered Mar 24 '17 at 16:10

GMM uses overlapping hills that stretch to infinity (but practically only count for 3 sigma). Each point gets all the hills' probability scores. Also, the hills are "egg-shaped" [okay, they're symmetric ellipses] and, using the full covariance matrix, may be tilted.

K-means hard-assigns a point to a single cluster, so the scores of the other cluster centers get ignored (are implicitly reset to zero/don't care). The hills are spherical soap bubbles. Where two soap bubbles touch, the boundary between them becomes a flat (hyper-)plane. Just as when you blow a foam of many soap bubbles, the bubbles on the inside are not flat but are boxy, so the boundaries between many (hyper-)spheres actually forms a Voronoi partition of the space. In 2D, this tends to look vaguely like hexagonal close-packing, think a bee-hive (although of course Voronoi cells are not guaranteed to be hexagons). A K-means hill is round and does not get tilted, so it has less representation power; but it is much faster to compute, especially in the higher dimensions.

Because K-means uses the Euclidean distance metric, it assumes that the dimensions are comparable and of equal weight. So if dimension X has units of miles per hour, varying from 0 to 80, and dimension Y has units of pounds, varying from 0 to 400, and you're fitting circles in this XY space, then one dimension (and its spread) is going to be more powerful than the other dimension and will overshadow the results. This is why it's customary to normalize the data when taking K-means.

Both GMM and K-means model the data by fitting best approximations to what's given. GMM fits tilted eggs, and K-means fits untilted spheres. But the underlying data could be shaped like anything, it could be a spiral or a Picasso painting, and each algorithm would still run, and take its best shot. Whether the resulting model looks anything like the actual data depends on the underlying physical process generating the data. (For instance, time-delay measurements are one-sided; is a Gaussian a good fit? Maybe.)

However, both GMM and K-means implicitly assume data axes/domains coming from the field of real numbers $R^n$. This matters based on what kind of data axis/domain you are trying to cluster. Ordered integer counts map nicely onto reals. Ordered symbols, such as colors in a spectrum, not so nicely. Binary symbols, ehn. Unordered symbols do not map onto reals at all (unless you're using creative new mathematics since 2000).

Thus your 8x8 binary image is going to be construed as a 64-dimensional hypercube in the first hyperquadrant. The algorithms then use geometric analogies to find clusters. Distance, with K-means, shows up as Euclidean distance in 64-dimensional space. It's one way to do it.

Note both algorithms also implicitly assume space axes are equally dense at all points, so fitting exponentially, logarithmically, or sinusoidally-varying data typically benefits from a pre-transform to remap the data into an approximately-linearly-varying domain. — DragonLord, Mar 24 '17 at 16:27

If k-means clustering is a form of Gaussian mixture modeling, can it be used when the data are not normal?

2 Answers2

Linked