0

Let a data set like this:

       |   Attribute1     Attribute2    Attribute3
--------------------------------------------------
Obs_1  |        1,00           10,00    4500,00
Obs_2  |        1,10           80,00    3200,00
Obs_3  |        0,90          137,00    95000,00
…      |        …              …        …
Obs_100|        1,05          -65,00    17000,00

As you can see, attributes have different ranges; furthermore, their summary statistics would likely return very different distributions.

Now let you want to apply clustering via K-means: how can this affect final result? If you imagine a 3D space with data points, coordinates belonging to Attribute1 are massed over $x$-axis, while those belonging to Attribute3 are likely to be strewn over $z$-axis because of very different scale.

My guess is that this overweights Attributes3 when it comes to minimize Euclidean distances to find cluster means: I guess that K-means finds out that about every Obs_i falls in the same cluster according Attribute1, therefore it finds out that Attribute3 is the real discriminant that should be chosen to create clusters.

My question: if my guess is correct, does this mean that some form of standardization should be applied before K-means? Standard score seems a good choice.

Lisa Ann
  • 575
  • 2
  • 5
  • 12

0 Answers0