Is K-means performance a bottleneck everywhere?

Question

I've read a paper about a sped-up version of k-means:

Ding et al. (2015). Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup.

Now I wonder, is k-means' performance a bottleneck anywhere? Do people really care about it? Do people need these faster versions of k-means? Do people run k-means in hadoop clusters? On GPUs and using libraries like tensorflow? Will k-means be used on google TPUs?

Alex R. · Accepted Answer · 2016-07-15T13:43:30.930

Considering how popular k-means is, yes, it's definitely run on clusters and most decent machine learning libraries have an implementation. It's complexity goes up linearly in $k$, the number of data points, and the number of dimensions. So it becomes infeasible in high dimensional spaces with many points. Because there aren't really any convolutions involved, I'm not sure of the advantage you'd get running it on a GPU.

As an example, if you try running $k$-means on the entire wikipedia corpus, using word features (i.e. each document being a vector of words from a preset dictionary), it takes around a day on a small cluster running Spark.

Something like spherical $k$-means is used extensively in compressed sensing. The only difference here is that distances are normalized to 1 and cosine similarity is used for error. Such compressions might easily occur in real-time applications, like transmission of data over long distances, so speed would be highly valued.

As far as applications go, any sort of real-time $k$-means with a short memory will be a huge bottle neck. As an example, imagine using $k$-means to study topic trends on twitter over the course of a day. Depending on the day's events, the clusters will move around significantly and you're getting a huge stream of points at any given time.

As well, $k$-means is rarely run once (although the precision of this depends on the implementation) since it's output is usually random. This overhead can be a pain when dealing with a ton of points.

Could you please explain why absence of convolutions makes using GPUs a worse idea? — CrabMan, Jul 15 '16 at 15:07
@crabman: ok I take that back maybe GPUs can offer some speedup. 40x is quoted here for a particular implementation: http://www.gpucomputing.net/sites/default/files/papers/908/K-Means%20on%20commodity%20GPUs%20with%20CUDA.pdf — Alex R., Jul 15 '16 at 15:17
@CrabMan: oh I was referring to the use of GPUs for stuff like convolutional neural nets. GPUs have a lot of fancy onboard equipment that is able to calculate convolutions very fast (for example one can calculate FFTs in this way) — Alex R., Jul 15 '16 at 20:02

Has QUIT--Anony-Mousse · Answer 2 · 2020-02-10T13:42:13.050

k-means is not particularly expensive.

It is also not applicable to "big data" because it only works on continuous numerical data. You can try running it on document-term-matrices but the results are usually very disappointing (and it was not designed for this).

Probably the only real use case where k-means is fairly costly is the "bag of visual words" approach, where you need to cluster billions of 128 dimensional dense SIFT vectors. However, this approach is dead now, because deep learning works much better for object recognition in images. A billion vectors times 128 dimensions times 8 bytes for double - that may actually exceed your PCs memory.

The main reason you find k-means in every big data tool is because at the textbook algorithm is dead simple to implement (even in parallel), and then people can stick the "can do cluster analysis" sticker on their product feature chart. Itjs about selling the product, not solving anything, unfortunately.

Before doing any further optimization on k-means, you should first do a big big literature review. For example Hamerly has much improved version of k-means, and the Pelleg-Moore approach scales extremely well. A single PC with these good algorithms will be much faster than all the cluster or GPU based solutions (which usually compare only to the dead slow Lloyd algorithm, and which are also based on this "parallelization friendly" algo - the nice ones like Elkan's, Hamerly's, Pelleg-Moore's are not GPU parallelization friendly because they do involve "if" statements to avoid unnecessary computations, and GPUs are bad at branching).

score 0 · Answer 3 · edited Apr 13 '17 at 12:44

is k-means' performance a bottleneck anywhere?

It is unclear for the definition of "bottleneck" and anywhere. So, it is hard to answer this question directly. For my experience (6 years with academia and industry) is that K-Means is still a widely used algorithm for its simplicity.

When you want to apply K-Means, features used are very important. Because it is based on Euclidean distance, curse of dimensionality is a problem. Also see here for why Euclidean distance will not work well in high dimension case.(Why is Euclidean distance not a good metric in high dimensions?)

The computation / optimization process is not as critical as the dimensional problem.

Another critical problem with Kmeans is that it can suffer from and local minima.

Dimensional and local minima are more important things people care comparing to the execution speed.

Do people really care about it?

Too subjective to answer. People I worked thinks Kmeans is a reasonable approach as well as you can justify it.

Do people need these faster versions of k-means?

As mentioned earlier, computation time is not as critical as multidimensional problem.

Do people run k-means in hadoop clusters? On GPUs and using libraries like tensorflow? Will k-means be used on google TPUs?

I personally think there is not too need for this. Because the most computation intensive task would be "feature extraction" and "feature engineering". Those things can be done in a big server. The extracted and aggregated feature may not be too large and the algorithm can be executed in a desktop. In my work, the raw data is on Tb level, but aggregated data is just few Mb.

Is K-means performance a bottleneck everywhere?

3 Answers3