Kmeans clustering results on pca dataset reduction

Question

I have a set of 321 observations of 18 correlated variables, so I do PCA to extract a low dimensional set of features from this high dimensional data set. I select 9 of 18 components (the number of components that explains 80% of total variance) After determining the number of clusters with NbClust, apply k-means clustering to do the classification.

I am using the PCA for dimensionality reduction in order to reduce the complexity of my problem, given an interpretation to all the components.

My Question: Why are the clusters differentiated only in PC1-other Component plane (example PC1-PC2 plane, PC1-PC3 plane, etc...)? How can I solve this problem?

You didn't say why it is "problem" necessary to solve for you? You expected different? what you extected then and why? The issue in your Q may be not just about PCA but about how [K-means behaves when dimensions are different variance](http://stats.stackexchange.com/q/21222/3277). — ttnphns, Dec 13 '16 at 15:58

score 0 · Answer 1 · answered Dec 13 '16 at 15:36

It might be helpful to look at a plot of the %variance explained vs Component number. If PC1 explains a lot more of the variance than any other component, it may be necessary for clustering of your data.

Also, PC1 may contain the important information that is differentiating your groups. PCA is looking at variance in general, so higher components are possibly capturing smaller deviations (such as those within a group).

score 0 · Answer 2 · answered Sep 27 '21 at 13:16

Components are ordered according to how much variability your data display on each of them. So the points on the opposite ends of the first component are farther away from each other compared with data points on the opposite ends of some other component. K-means works by looking at distances between points. When two points are on the opposite end of PC1 projection - their difference is a lot bigger compared to when they are on different ends on, say, PC8 projection. This is not a problem.

Kmeans clustering results on pca dataset reduction

2 Answers2