2

Consider the SVD of a centered data matrix:

$$ X_{centered} = U \Sigma V^T$$

where a column of $X_{centered}$ is:

$$ X_{centered} = x^{(i)} - \frac{1}{N} \sum^N_{n=1} x^{(n)} $$

is the matrix $ U $ the same as the matrix coeff that the function pca uses?

I could have sworn that they were the same until I wrote the following code:

clear;clc;
%% data
D = 3
N = 5
X = rand(D, N);
%X = magic(N); %% <------ uncomment this line for disaster
%% process data
x_mean = mean(X, 2); %% computes the mean of the data x_mean = sum(x^(i))
X_centered = X - repmat(x_mean, [1,N]);
%% PCA
[coeff, score, latent, ~, ~, mu] = pca(X'); % coeff =  U
[U, S, V] = svd(X_centered); % coeff = U
%% Reconstruct data
% if U = coeff then the following should be an identity I (since U is orthonormal)
U * U'
coeff * coeff'
% if U = coeff then they should be able to perfectly reconstruct the data
X_tilde_U = U * U'*X
X_tilde_coeff = coeff*coeff'*X

but then if one uncomments X = magic(N); and uses magic as the data matrix instead of random vectors, then we get different results from coeff and U. Meaning that either:

  1. They are not the same (i.e. either I have a misunderstanding that the left singular vectors of the centered data is not the principal components)

OR

  1. the matrix magic has some special properties that makes the pca in matlab be broken.
amoeba
  • 93,463
  • 28
  • 275
  • 317
Charlie Parker
  • 5,836
  • 11
  • 57
  • 113
  • 1
    Charlie, here is a hint: with `X=magic()` you have $N$ points in $N$ dimensions. This can only give you 4 PCs and that's what you get with `pca()`. But `svd()` returns 5 axes, 5th being arbitrary. Unless you use `svd()` with `'econ'` parameter. (CC to @usεr11852) – amoeba May 15 '16 at 19:12
  • 1
    @amoeba, I think you want to say using `svds` with `k` smaller than $N$. For a square matrix, the `econ` option won't change anything. Anyway to put this question to sleep: CharlieParker, check your `V`, `coeff` are the right singular vectors. – usεr11852 May 15 '16 at 19:34
  • 1
    @usεr11852: You are quite right, `econ` won't change anything in this case, I withdraw the last sentence of my previous comment. Thanks! But you are wrong about right singular vectors: again, X here is DxN, so PCA eigenvectors are *left* singular vectors. Specifically, `coeff` is 4 first columns of `U`. – amoeba May 15 '16 at 19:36
  • 1
    @amoeba... Bloody arbitrary transpositions, of course you are right. Servers me well not taking the time to read the question fully. Anyway, CP should know what is going on by now based on the comments. – usεr11852 May 15 '16 at 19:39
  • @amoeba sorry if this is trivial for you, but why can we only have 4 pca's? I thought it might had to do with the rank of magic(5) but it seems magic(5) has rank 5...not sure if I am thinking it in the wrong way... – Charlie Parker May 15 '16 at 19:55
  • CP, check the centred matrix... – usεr11852 May 15 '16 at 19:58
  • @usεr11852 yes, the left singular values U (except the last one) and coeff match. It seems they are the same. I am still puzzled though why the function svd computes the last arbitrary thing. Do you mind clarifying that to me? I've been reading svd's reference page and it has not been very enlightening. Thanks btw! :) – Charlie Parker May 15 '16 at 20:00
  • @usεr11852 ah....$X_{centered}$'s rank 4 not 5. probably has to do with the fact all columns,rows, diagonals add to the same number. – Charlie Parker May 15 '16 at 20:01
  • See here: http://stats.stackexchange.com/questions/123318. Does it clarify? – amoeba May 15 '16 at 20:15
  • @amoeba ah, so it seems that it has to do with the number of data points and the dimension of each data point. This confuses me a little bit, I thought it had to do with the fact that magic(N)'s rank decreased by one when it was centered... – Charlie Parker May 15 '16 at 20:36
  • 1
    @Charlie The rank of *any* matrix that has $D\ge N$ decreases by one when it is centered. This has nothing to do with your matrix being "magic" :-) – amoeba May 15 '16 at 20:40
  • 1
    @Charlie See my (updated) comment to user11852' answer. – amoeba May 15 '16 at 20:44

1 Answers1

3

Just to close this: What takes place is that the fifth axis is arbitrary given the dimensions of the sample ($5 \times 5$) generated by magic after that sample is centred (ie. the rank(X_centered) is 4). This is not due to the magic row-/col-sum property of magic squares but rather to their square nature. pca is intelligent enough to detect this rank-deficiency so it just returns four axes. If one replaces the line: [U, S, V] = svd(X_centered); with [U, S, V] = svds(X_centered,4); the same results will be obtained by both procedures.

In general the rank of magic square matrix is a funny thought experiment. Cleve Moler has wrote an excellent series of blog posts on the matter.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • I'm still a bit confused, it seems that the issue is that the rank of magic decreases by 1 when its centered (which intuitively makes sense since they all add to the same number) but amoeba seems to imply that its an issue that the dimension of the data is $ D \geq N$ at least the number of data set points. Which one is it? – Charlie Parker May 15 '16 at 20:38
  • 1
    @Charlie, It's the same thing. If your data matrix has $D\ge N$ then its rank will decrease by 1 after centering. If, on the other hand, $D – amoeba May 15 '16 at 20:42