3

I have 10 variables and some of them are highly correlated. So before I do k - means, I want to get lower number of variables that are not correlated, but retain as much information as possible. Thus, I decided to do PCA before k - means. However, one variable is far from normal, as it has many zeros and looks like to follow gamma distribution. Therefore it is problematic to adequately transform it. Nevertheless, this variable is not correlated with any other variable.

So the question is: is it a valid solution to run PCA with all variables, except the one which is uncorrelated and not normal, and later put principal components and that variable to a one data frame, then scale and centre it and then run k - means with that data frame?

  • 1
    1) https://stats.stackexchange.com/q/112277/3277 K-means clustering does not require uncorrelated variables. (Few of other clustering methods do prefer uncorrelated.) 2) 10 variable isn't too much from the 'curse of dimensionality' perspective, so PCA is generally not needed; you might lose some information if you do PCA. What potential other considerations incline you towards doing PCA? - put them forward, please. – ttnphns Jun 03 '17 at 10:29
  • The question of a skewed variable in k-means is two-fold. It is true that k-means assumes approximately symmetrical-distributed clusters. On the other hand, a skew variable may sometimes imply there is one big (populated) cluster at one edge and a number of small clusters, but all clusters are more or less symmetric inside. – ttnphns Jun 03 '17 at 10:34

0 Answers0