When to use PCA of features and when of samples?

Question

I am learning now about the PCA and ZCA applications for the machine learning problems of classification and clustering. I would like to apply PCA and ZCA mostly, but not only, to image data. From what I understand, if we have a data matrix $X$ with dimensions $(n,m)$, $n=$ number of features and $m=$ number of samples, then we can calculate the covariance matrix as $\Sigma_1 = XX^T$ if we want to reduce correlations of the features and $\Sigma_2=X^TX$ if we want to reduce correlations of the samples.

My question: is there a rule of thumb to check if in a given case it makes more sense to use $\Sigma_1$ or $\Sigma_2$?

I arrived at asking this question after I figured out that calculating the SVD of $\Sigma_1$, with $\dim=(n,n)$, is not possible on my computer if n>4000, what corresponds to not using colour images with more than 32 pixels (32*32*3 colour channels $\approx$ 4000). But then, if $m<n$, let's say $m\approx 1000$, I could much more quickly calculate $\Sigma_2$ then $\Sigma_1$. Additional questions could be: What caveats do you see in my idea? Is there an easy way to speed up the SVD of $\Sigma_1$ with some python package?

score 2 · Answer 1 · answered Apr 19 '20 at 12:41

then we can calculate the covariance matrix as $\Sigma_1 = XX^T$

Quick note that this formula holds only for zero-centered data. That is, before calculating $\Sigma_1$, you have done this in your code: X = X - X_mean.

is there a rule of thumb to check if in a given case it makes more sense to use $\Sigma_1$ or $\Sigma_2$?

To answer your question, if decomposing $\Sigma_1$ is prohibitive for size/time reasons, you can decompose $\Sigma_2$ to calculate eigenvectors of $\Sigma_1$. This works because if ${e}$ is an eigenvector of $\Sigma_2$, then $Xe$ is an eigenvector of $\Sigma_1$. Proof is below: $$ \Sigma_2e=\text{c}\hspace{1mm}e \\ X^TXe = \text{c}\hspace{1mm}e \\ X(X^TXe) = X\text{c}\hspace{1mm}e \\ (XX^T)(Xe)=\text{c}\hspace{1mm}(Xe) \\ \Sigma_1(Xe) = \text{c}\hspace{1mm}(Xe) $$

In these equations, $\text{c}$ is a constant. Using this trick, you can compute $m$ eigenvectors for $\Sigma_1$. Proof is taken from this PDF, which also discusses other ways to compute principal components given memory issues. Page 29 and 30 of this document specifically addresses your concern.

Thanks for the answer! I had also found previously this pdf, it is indeed helpful. My actual question was if $\Sigma_1$ and $\Sigma_2$ have different applications, but after further reading it looks like they both lead to the same matrix $U_1$ from $U_1, S_1, V_1 = svd(\Sigma_1)$, it is just that if you start with $\Sigma_2$ instead of $\Sigma_1$ in some cases ($m — NeStack, Apr 22 '20 at 14:54

score 1 · Answer 2 · answered Apr 19 '20 at 12:51

1

You can also look at iterative Singular Value Decomposition algorithms to do PCA on large matrices instead of eigendecomposition on the either $\Sigma_2$ or $\Sigma_1$.

Regarding your ZCA vs PCA question it has also been answered here: What is the difference between ZCA whitening and PCA whitening?

answered Apr 19 '20 at 12:51

vzografos

73
4

Thanks for the answer, I had a look at iterative Singular Value Decomposition algorithms and it looks like there is indeed a possibility for them to increase the computational speed. Since you are a new user I recommend you in the future to provide examples, additional info, why these algorithms would be good, etc., to your answer. It is not very informative else. Furthermore, the difference of ZCA and PCA wasn't my question. – NeStack Apr 22 '20 at 14:47

score 0 · Answer 3 · answered Apr 22 '20 at 15:53

Answer to my own question:

After further reading, I couldn't find another application for $\Sigma_2$ than arriving quicker to the same results as $\Sigma_1$, but only for the cases where $m<n$. So, it was a mistake of mine to think $\Sigma_2$ represents the "correlations of the samples", it is just a possible short-cut to the "correlations of the features". If you know of another application of $\Sigma_2$ let me know!

Below I will show how to use $\Sigma_2$: so, for PCA one wants to find the representation of $X$ in a new coordinate space spanned by $u_{1i}$ vectors, the column vectors of the matrix $U_1$, obtained from singular value decomposition:

$U_1, S_1, V_1 = svd(\Sigma_1)$

If for the data Matrix $X$, with $dim(X)=(n,m)$, it holds $m<n$ than it might be quicker for you to obtain $U_1$ by doing the following:

$U_2, S_2, V_2 = svd(\Sigma_2)$

$U_{2}^{*} = X U_2$

You need then to divide each of the columns $u^{*}_i$ of $U_{2}^{*}$ by the square root of the corresponding element $s_i$ in $S_2$ (in pythons numpy you can just do for that U_hat2 = U_star2/np.sqrt(S2)):

$\hat{U_{2}} = [\frac{1}{\sqrt{s_{1}}} \cdot u_1 , \frac{1}{\sqrt{s_{2}}} \cdot u_2, ..., \frac{1}{\sqrt{s_{m}}} \cdot u_m] $

$\hat{U_{2}}$ has dimensions $(n,m)$, but if you cut it up to the n-th column it is equal to $U_1$, in python this can be expressed as U_1 == U_hat2[:,:n].

And this is how you can arrive in two ways to $U_1$, the second way offering to save time when $m<n$

When to use PCA of features and when of samples?

3 Answers3