k-means clustering on percentages

Question

Can we do k-means clustering on percentage data (like 56%, 44%, 22%, 13%, etc.)?
There is a data set, and data in various parts are measured in percentages.

You can apply K-Means on any data as long as your covariates are expressed as numericals. Is it a good idea?! We don't know by simply examining the data range (i.e. percentages), data visualization, domain knowledge and testing are imperative. — Ramalho, Oct 17 '14 at 11:20

score 1 · Answer 1 · answered Jul 22 '14 at 13:39

I don't see any reason not to. The percentage values are just classical numbers all divided by another one.

If other part of the data are not in percentage you might have to scale the data appropriately (or turn them also in percentage) or to choose carefully the distance you use.

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

Because many clustering algorithms (very much including k-means) are thrown off by data in which the variables have different ranges (cf., this excellent CV thread: Why does gap statistic for k-means suggest one cluster, even though there are obviously two of them?), it is very common to normalize all variables first (i.e., convert the range to $[0,\ 1]$, see here). In this way, it is common to run k-means on data where all variables are expressed in essence as percentages.

score 0 · Answer 3 · answered Aug 02 '18 at 05:11

I think there's no theoretical issue against it, but my gut tells me to avoid including variables that explain 100% of a given item; for example if you're measuring an individual's use of time and clustering across individuals to differentiate Couch Potatoes from Stats Fans from Hiking Enthusiasts, and you have five values to associate time (sleep, eat, exercise, work, play), don't include all five as percentages. In linear regression this causes perfect goodness-of-fit values. In a cluster problem it seems like it would also overfit.

k-means clustering on percentages

3 Answers3

Linked