PCA clustering results 'ruined' by standartization

Question

I have some data that I want to classify. As an initial measure, I did PCA for the data and I saw two distinct clusters of my data. However, when standardizing the data, the two clusters disappear. What can this mean? that the data is easily separated by individual variance or mean of the features? if that is the case, how can I do classification?

Thanks.

Edit: Following the commenters' requests, I am adding an image of my clustered data. Since the PCA is of higher dimension than 3, it is hard to see why classification succeeds from this image. Also, the colors are the TRUE results, not estimated ones.

Both PCA results are clustering (K-means, in particularly) results are sensitive to standardization of variables. Whether you do PCA/clustering in conjunction or independently, you should first think it over - to do or not to do standardization. And then go ahead with your dicision. — ttnphns, Nov 09 '14 at 00:19
Can you share *visualization* of the effects you have been observing? Solving your problem with just two sentences of description is hard, it would really benefit from visualization. — Has QUIT--Anony-Mousse, Nov 09 '14 at 15:32
Also, "classification" is not "clustering". The first is supervised, the second is un-supervised. Please clarify what you are actually doing. — amoeba, Nov 09 '14 at 21:55
@Anony-Mousse I see two distinct clusters (in 3D), and with standartization - I see a single 'messy' cluster with clustering performance of about 50% (i.e. no clustering). — yoki, Nov 09 '14 at 22:31
@amoeba My main purpose is to perform classification, and clustering is a preliminary attempt to see how my data 'look like'. Since they are divided to clusters, I think classification should work as well. Or is my conclusion misguided? — yoki, Nov 09 '14 at 22:32
Can you link the images, please? So we can see what you see? Is your PCA approach correct? (in particular, do you center your data?) — Has QUIT--Anony-Mousse, Nov 10 '14 at 08:48
@Anony-Mousse I added an image. I do not centralize my data for PCA, when I do I get some clustering, but with worse results. I know centralizing is important for PCA, but as far as I can see, if I see two clusters when using some original data, it may mean that they indeed belong to two clusters... — yoki, Nov 10 '14 at 09:07
@Anony-Mousse I checked with centralizing and it appears that with the correct kernel I get get good clustering results as well. — yoki, Nov 10 '14 at 10:03
The image, was this before PCA, or after? (After correct PCA; the data should have mean 0, variance 1, covariance 0 in any dimension; so this actually must be before PCA...) — Has QUIT--Anony-Mousse, Nov 10 '14 at 18:26
It was after PCA, but I scaled it nonlinearly by a logarithm to better visualize the results. — yoki, Nov 10 '14 at 19:01
If the picture above is with log axes, the outliers probably determine your PCA. Is the log meaningful for your data (e.g. is it multiples of something)? In that case you may need to go one step back and think about your data representation first. — cbeleites unhappy with SX, Nov 11 '14 at 09:51
Okay, thank you. I will try to see how I need to scale my original data. — yoki, Nov 12 '14 at 12:29

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

Yes there are situations where standardization (mean centering and variance scaling of all variates) can do harm.

See e.g.:

Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one?
When should normalization never be used?
When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

edited Apr 13 '17 at 12:44

Community

1

answered Nov 10 '14 at 12:01

cbeleites unhappy with SX

34,156
3
67
133

But it shouldn't harm for PCA IMHO. – Has QUIT--Anony-Mousse Nov 10 '14 at 18:24
@Anony-Mousse: I guess that depends on what you consider harming a PCA - a full PCA ultimately reconstructs the complete data set, so all information is still there. But it may cause the interesting information to appear only in higher PCs after meaningless variance that is caused by scaling up originally low (but correlated) signals. – cbeleites unhappy with SX Nov 26 '14 at 15:30

PCA clustering results 'ruined' by standartization

1 Answers1