Considering how popular k-means is, yes, it's definitely run on clusters and most decent machine learning libraries have an implementation. It's complexity goes up linearly in $k$, the number of data points, and the number of dimensions. So it becomes infeasible in high dimensional spaces with many points. Because there aren't really any convolutions involved, I'm not sure of the advantage you'd get running it on a GPU.
As an example, if you try running $k$-means on the entire wikipedia corpus, using word features (i.e. each document being a vector of words from a preset dictionary), it takes around a day on a small cluster running Spark.
Something like spherical $k$-means is used extensively in compressed sensing. The only difference here is that distances are normalized to 1 and cosine similarity is used for error. Such compressions might easily occur in real-time applications, like transmission of data over long distances, so speed would be highly valued.
As far as applications go, any sort of real-time $k$-means with a short memory will be a huge bottle neck. As an example, imagine using $k$-means to study topic trends on twitter over the course of a day. Depending on the day's events, the clusters will move around significantly and you're getting a huge stream of points at any given time.
As well, $k$-means is rarely run once (although the precision of this depends on the implementation) since it's output is usually random. This overhead can be a pain when dealing with a ton of points.