1

I have some data that I want to classify. As an initial measure, I did PCA for the data and I saw two distinct clusters of my data. However, when standardizing the data, the two clusters disappear. What can this mean? that the data is easily separated by individual variance or mean of the features? if that is the case, how can I do classification?

Thanks.

Edit: Following the commenters' requests, I am adding an image of my clustered data. Since the PCA is of higher dimension than 3, it is hard to see why classification succeeds from this image. Also, the colors are the TRUE results, not estimated ones.

yoki
  • 1,336
  • 11
  • 22
  • 1
    Both PCA results are clustering (K-means, in particularly) results are sensitive to standardization of variables. Whether you do PCA/clustering in conjunction or independently, you should first think it over - to do or not to do standardization. And then go ahead with your dicision. – ttnphns Nov 09 '14 at 00:19
  • 1
    Can you share *visualization* of the effects you have been observing? Solving your problem with just two sentences of description is hard, it would really benefit from visualization. – Has QUIT--Anony-Mousse Nov 09 '14 at 15:32
  • Also, "classification" is not "clustering". The first is supervised, the second is un-supervised. Please clarify what you are actually doing. – amoeba Nov 09 '14 at 21:55
  • @Anony-Mousse I see two distinct clusters (in 3D), and with standartization - I see a single 'messy' cluster with clustering performance of about 50% (i.e. no clustering). – yoki Nov 09 '14 at 22:31
  • @amoeba My main purpose is to perform classification, and clustering is a preliminary attempt to see how my data 'look like'. Since they are divided to clusters, I think classification should work as well. Or is my conclusion misguided? – yoki Nov 09 '14 at 22:32
  • Can you link the images, please? So we can see what you see? Is your PCA approach correct? (in particular, do you center your data?) – Has QUIT--Anony-Mousse Nov 10 '14 at 08:48
  • @Anony-Mousse I added an image. I do not centralize my data for PCA, when I do I get some clustering, but with worse results. I know centralizing is important for PCA, but as far as I can see, if I see two clusters when using some original data, it may mean that they indeed belong to two clusters... – yoki Nov 10 '14 at 09:07
  • @Anony-Mousse I checked with centralizing and it appears that with the correct kernel I get get good clustering results as well. – yoki Nov 10 '14 at 10:03
  • The image, was this before PCA, or after? (After correct PCA; the data should have mean 0, variance 1, covariance 0 in any dimension; so this actually must be before PCA...) – Has QUIT--Anony-Mousse Nov 10 '14 at 18:26
  • It was after PCA, but I scaled it nonlinearly by a logarithm to better visualize the results. – yoki Nov 10 '14 at 19:01
  • 1
    If the picture above is with log axes, the outliers probably determine your PCA. Is the log meaningful for your data (e.g. is it multiples of something)? In that case you may need to go one step back and think about your data representation first. – cbeleites unhappy with SX Nov 11 '14 at 09:51
  • Okay, thank you. I will try to see how I need to scale my original data. – yoki Nov 12 '14 at 12:29

1 Answers1

1

Yes there are situations where standardization (mean centering and variance scaling of all variates) can do harm.

See e.g.:

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • But it shouldn't harm for PCA IMHO. – Has QUIT--Anony-Mousse Nov 10 '14 at 18:24
  • @Anony-Mousse: I guess that depends on what you consider harming a PCA - a full PCA ultimately reconstructs the complete data set, so all information is still there. But it may cause the interesting information to appear only in higher PCs after meaningless variance that is caused by scaling up originally low (but correlated) signals. – cbeleites unhappy with SX Nov 26 '14 at 15:30