0

We have 16 variables which are indices produced by calculations based on ratio (unitless in fact). Some examples of the ranges of our variables are (0.450-0.750), (0.000 - 0.800) and (0.000 - 1.000). Based on this data, we want to apply hierarchical and K-means clustering algorithms. According to the literature, it is recommended to apply standardization before PCA and clustering follows this. In our case, covariance matrix is proposed for PCA but we are not sure we should apply standardization before this process.

If you could help us in this issue, we would be glad.

Thanks in advance for your answers.

user2067
  • 1
  • 1
  • 2
    Clustering algorithms such as k-means and OLS PCA are both sensitive to redundancy in the features and scale, where "scale" refers to the type (e.g., ordinal, interval or ratio) as well as the standard deviations of the features. PCA controls the former issue while some sort of feature transformation is recommended for the latter. This transformation can take many forms such as by the range, ipsative rescaling (e.g., dividing by the maximum value for a series), the IQR, Box-Cox transforms, standardizing to a mean of zero and a std dev of one, etc – Mike Hunter Apr 04 '16 at 19:03
  • Why are you thinking to do PCA first? You have 16 features that's not at all many to seriously bother with "curse of dimensionality" problem. Doing PCA and dropping some last of the components is potentially fraught with losing information important for the clustering. But in clustering, standardization issue should be considered, of course. – ttnphns Apr 05 '16 at 08:13
  • @DJohnson What does OLS have to do with PCA? – Nick Cox Apr 05 '16 at 08:20
  • @NickCox Right! I think we've been down this road. Apologies for forgetting that thread... – Mike Hunter Apr 05 '16 at 09:55
  • @DJohnson I don't recall which thread you're referring to, but the same puzzling mention is likely to elicit the same puzzled comment. My answer is that plain or standard PCA is just a transformation; there is no estimation and OLS is not entailed. What's yours? – Nick Cox Apr 05 '16 at 10:08
  • OP: If the units of measurement are genuinely comparable, then standardization of any kind may not be a good idea, either before the PCA or during. In other words, it may be that covariance-based PCA on the raw data is what is best. We can't tell on your information: what to do depends on the precise definitions of your variables and your goals. – Nick Cox Apr 05 '16 at 10:12
  • @NickCox No worries. I do remember it and you convinced me that you were correct about PCA...I was wrong. However, I do disagree about encouraging the OP to focus solely on the "units of measurement" to make a determination about a transformation. In my experience, that's less important than the variance of the predictors, even for predictors with the same unit. If that cross feature variance is large, then the PCA will be weighted towards (or distorted by) the larger variance features. This strongly suggests some normalizing (stdzing) transformation. – Mike Hunter Apr 05 '16 at 10:36
  • @DJohnson Absolutely nothing in my comments states or implies that units of measurement are the _sole_ determinant of what is to be done. I used the word "may" twice; there is no "must" or "should". I suggest only that there might, so far as we can tell, be a case for no standardisation at all. It might be entirely right that variables that are nearly constant have the slight effect that they would. (In my experience, PCA of a mishmash of variables is often disappointing, regardless of standardization, but that is a different argument.) I also flagged that precise goals are crucial. – Nick Cox Apr 05 '16 at 11:07
  • @NickCox Fair enough...I didn't read your comment closely enough. – Mike Hunter Apr 05 '16 at 11:19
  • @DJohnson OK. The exchange raises some important general issues. Unless and until the OP fleshes out their question I doubt there's much scope for, or incentive for providing, further answers. – Nick Cox Apr 05 '16 at 11:22
  • Thanks a lot for the answers. In addition, I forgot to mention that most of our variables are highly correlated and their variances range from 0.002 (min) to 0.036 (max). In that case, which is the best option among the followings: 1. Covariance-based PCA using raw data and then clustering of PCA variables without extra standardization, 2. Just clustering based on raw data (without PCA) 3. Clustering based on standardized data (without PCA). – user2067 Apr 05 '16 at 12:23
  • 1
    Thanks for the further detail, but the question is still similar to which car or television or life-partner you should choose out of three (and why those three and no others). We can't see your data and the detail you have given doesn't pin down how well any method will work, especially in relation to your unstated _scientific_ goals. – Nick Cox Apr 05 '16 at 20:06

1 Answers1

0

Which literature recommends standardization before PCA on such data? I've never seen this being recommended.

Essentially, PCA is multivariate standardization, so there is some redundancy between first standardizing every attribute on its own, then doing PCA.

As with any form of normalization, it may help, and it may harm. Correlations in your data can be harmful, or helpful. If the variables are corlated because they originate from the same (good) signal, that's great! But if they are correlated because of some root cause that is not helpful, it can be better to reduce the correlation.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • In addition, I forgot to mention that most of our variables are highly correlated and their variances range from 0.002 (min) to 0.036 (max). In that case, which is the best option among the followings: 1. Covariance-based PCA using raw data and then clustering of PCA variables without extra standardization, 2. Just clustering based on raw data (without PCA) 3. Clustering based on standardized data (without PCA). – user2067 Apr 05 '16 at 15:15
  • There is no automatism that removing correlation improves results. If the correlation is your sinal, then you want to keep it. If it is not interesting, you can want to remove it. **Any of the three can be "best"** – Has QUIT--Anony-Mousse Apr 05 '16 at 19:22