PCA vs. Spectral Clustering with Linear Kernel

Question

Consider a feature vector matrix $X := [x_1 x_2 \dots x_d] \in \mathbb {R}^{n\times d} $ that I hope to use as part of some supervised learning procedure, say, regression. Suppose that also, $d \gg n $, where $d $ is the number of features and $n $ is the number of datum. I want to somehow reduce the number of features I'm using for my model while preserving as much information as I can.

I can think of many ways to do this, but would particularly like to ask about the difference between using PCA and using spectral clustering with a linear kernel. PCA can be interpreted as finding the directions (principal vectors) that best explain the variation (have highest variance) in the data and is traditionally accomplished via SVD. Using PCA, I can "compress" my features to a much smaller number while minimizing the loss of volume my features span.

Spectral clustering, on the other hand with the linear kernel $\mathcal{K}(x,y) := \langle x, y \rangle = x^\top y$ essentially maps the data to a space where the distance metric (I use this term very loosely here) is the correlation or inner-product between the vectors. Then we project onto the top eigenvectors and implement Lloyd's algorithm (K-means) or some variant. The clusters, in this case, will be a set of vectors that are highly correlated, while the cluster centroids represent these different groups of vectors.

Somehow, the top principal vectors and these cluster centroids strike me as very similar, yes I feel I may be missing something. What is the difference between a small set of vectors that explain the most variation in the data (the top principal vectors) and a set of vectors that summarize or represent a cluster of highly correlated features (the spectral clustering centroids)? How would this translate to my overarching regression problem?

Edit: Please assume the data I'm interested in are normalized.

score 2 · Answer 1 · edited Jul 29 '19 at 21:13

2

PCA works on the raw data, not on the similarity matrix.

I.e. in applies eigendecomposition on the $\mathbb{R}^{d\times d}$ covariance matrix (or SVD on the data matrix), whereas spectral clustering decomposes the double-centered similarity matrix (which is $\mathbb{R}^{n\times n}$) using eigendecomposition. I.e. they have a common mathematical operation, but are not that similar.

Some links between k-means and PCA have been discussed (see Wikipedia, and questions here) but may be not too strong. PCA components will (at least for $k=2$ and maybe $k\leq d$) provide a useful initial estimate of the cluster centers, but at a rather high computational cost.

edited Jul 29 '19 at 21:13

Community

1

answered Jan 05 '16 at 09:51

Has QUIT--Anony-Mousse

39,639
7
61
96

1

I know almost nothing about spectral clustering, but PCA can work on a $n \times n$ matrix too: if $X$ is the data matrix, then $XX^\top$ (matrix of scalar products between all pairs of data points) is called Gram matrix and its eigenvectors are (normalized) principal components. So if "similarity matrix" in spectral clustering with linear kernel is also $XX^\top$, then you might want to revise your second paragraph. Apart from that, you seem to have forgotten to include some links in your third paragraph! Cheers. – amoeba Jan 09 '16 at 15:36
If you decompose the Gram matrix instead of the covariance mateix, it is no longer the *standard* PCA of Pearson. It's something similar, and related, but not "the PCA". – Has QUIT--Anony-Mousse Jan 09 '16 at 21:27
2

I disagree, it's precisely standard PCA, just different way of computing it. When you do SVD of the centered data matrix, it's also PCA. This is all mathematically equivalent. – amoeba Jan 09 '16 at 21:38
The efficiency and the way of computing are part of the algorithm, not only the result. In particular with floating-point math and on computers, they *no longer* yield the identical result. – Has QUIT--Anony-Mousse Jan 09 '16 at 22:16
1

I don't think that's a good argument: two different implementations of computing eigenvectors of the covariance matrix will not yield "the identical" result either. Does your definition of "PCA" specify a particular algorithm of eigendecomposition? A particular programming language? A particular linear algebra library? Any book on machine learning I know will call SVD of the data matrix (or eigendecomposition of the Gram matrix) a way to perform PCA; I can give references if you wish. Apart from that, congratulations for getting a gold badge in [clustering] in December! I noticed only now. – amoeba Jan 10 '16 at 01:01
In fact I'm interested in the reference, because it is not obvious how to obtain the eigenvectors in the original data space from an eigenvector of the Gram matrix. – Has QUIT--Anony-Mousse Jan 10 '16 at 11:47
That is of course not possible, but neither it is possible to get a projection of the data on the PC axis from the covariance matrix only. The situation is exactly symmetric. I explain it here: http://stats.stackexchange.com/a/147983, but to give a reference that it is indeed called "PCA" (not only by me), look e.g. in Bishop's ML textbook, Section 12.1.4. – amoeba Jan 10 '16 at 17:40

PCA vs. Spectral Clustering with Linear Kernel

1 Answers1