Let a data set like this:
| Attribute1 Attribute2 Attribute3
--------------------------------------------------
Obs_1 | 1,00 10,00 4500,00
Obs_2 | 1,10 80,00 3200,00
Obs_3 | 0,90 137,00 95000,00
… | … … …
Obs_100| 1,05 -65,00 17000,00
As you can see, attributes have different ranges; furthermore, their summary statistics would likely return very different distributions.
Now let you want to apply clustering via K-means: how can this affect final result? If you imagine a 3D space with data points, coordinates belonging to Attribute1
are massed over $x$-axis, while those belonging to Attribute3
are likely to be strewn over $z$-axis because of very different scale.
My guess is that this overweights Attributes3
when it comes to minimize Euclidean distances to find cluster means: I guess that K-means finds out that about every Obs_i
falls in the same cluster according Attribute1
, therefore it finds out that Attribute3
is the real discriminant that should be chosen to create clusters.
My question: if my guess is correct, does this mean that some form of standardization should be applied before K-means? Standard score seems a good choice.