Should I expect inertia from a K-Means solution on counts to be substantially lower than for a similar solution on percentages?

Question

During exploratory clustering with K-Means on agents with a range of events, I created two sets of models across clusters with K in {2,..,9}. In one set, the model is fit using raw counts of five kinds of events for any given agent, three of which are mutually exclusive; in the other set, the features are four percentages (two of the three from the set of mutually exclusive events) that express the percentage of an agent's total events. Both sets of features are scaled using MinMax to the range [-1,1].

I fit the models using PySpark's implementation of K-Means and was surprised to find that inertia (or WSSSE) calculated using the computeCost method, showed an inertia value two magnitudes higher for each cluster solution using percentages than the solution using counts. I'd have thought there would be little difference because both models used the same scaling, but it's almost as if the clusters built on percentages are somehow more diffuse than those built on counts.

What should I look for to help me understand why the inertia is so much higher fitted to percentages rather than to counts?

I don't think you should scale percentages *ever*. Why do you think that is appropriate? — Has QUIT--Anony-Mousse, Aug 02 '18 at 17:57
@Anony-Mousse, that's a valid question. The events vary across agents so that the model has to fit a huge range - from 8 to 1,700. We're interested in clustering agents to discover preference regardless of the range of events. When I cluster on raw counts, then it's count of events that determines cluster, with average likelihood to engage event by agent relatively homogenous across clusters. I'm going to try your suggestion of not scaling the percentage-transformed features. — MisterJT, Aug 03 '18 at 18:20
And I misread your question, @Anony-Mousse. I thought you were asking why I was clustering on percentages. — MisterJT, Aug 03 '18 at 18:31

score 1 · Answer 1 · answered Aug 02 '18 at 18:03

Never compare WCSS across different data versions or data sets

It's trivial to see that scaling all attributes by a factor of 2 does not affect the clustering, but changes the WCSS by a factor of 4. So you can arbitrarily inflate WCSS - or reduce them. Just scale your data by 0.00001.

At the very minimum, look at WCSS/TSS - this is at least invariant to homogeneous scaling. But in general I do not recommend to rely on such measures at all, unless you understand them very well (and you very carefully chose weights and scales for your data).

Likely the inappropriate scaling (never scale percentages, they already have a nice scale) emphasized these errors. Don't blindly default to scaling everything; but rather study when and when not to use scaling (and which attributes to scale), and then always also weight different attributes.

@Anony-Mousee. The nomenclature isn't familiar to me. What's the distinction between within-cluster sum of squares and total sum of squares? — MisterJT, Aug 03 '18 at 18:45
Found it ... https://stats.stackexchange.com/questions/147007/a-proof-of-total-sum-of-squares-being-equal-to-within-cluster-sum-of-squares-and — MisterJT, Aug 03 '18 at 18:55

Should I expect inertia from a K-Means solution on counts to be substantially lower than for a similar solution on percentages?

1 Answers1

Never compare WCSS across different data versions or data sets