Consider a feature vector matrix $X := [x_1 x_2 \dots x_d] \in \mathbb {R}^{n\times d} $ that I hope to use as part of some supervised learning procedure, say, regression. Suppose that also, $d \gg n $, where $d $ is the number of features and $n $ is the number of datum. I want to somehow reduce the number of features I'm using for my model while preserving as much information as I can.
I can think of many ways to do this, but would particularly like to ask about the difference between using PCA and using spectral clustering with a linear kernel. PCA can be interpreted as finding the directions (principal vectors) that best explain the variation (have highest variance) in the data and is traditionally accomplished via SVD. Using PCA, I can "compress" my features to a much smaller number while minimizing the loss of volume my features span.
Spectral clustering, on the other hand with the linear kernel $\mathcal{K}(x,y) := \langle x, y \rangle = x^\top y$ essentially maps the data to a space where the distance metric (I use this term very loosely here) is the correlation or inner-product between the vectors. Then we project onto the top eigenvectors and implement Lloyd's algorithm (K-means) or some variant. The clusters, in this case, will be a set of vectors that are highly correlated, while the cluster centroids represent these different groups of vectors.
Somehow, the top principal vectors and these cluster centroids strike me as very similar, yes I feel I may be missing something. What is the difference between a small set of vectors that explain the most variation in the data (the top principal vectors) and a set of vectors that summarize or represent a cluster of highly correlated features (the spectral clustering centroids)? How would this translate to my overarching regression problem?
Edit: Please assume the data I'm interested in are normalized.