Why is sklearn PCA implementation in Python sensitive to the order of columns in source data?

Question

The rotation matrix outputted by the PCA algorithm should be independent of something trivial like the column ordering of the source data. Can anyone explain why my output diverges from my expectation for consistency?

I made a test input file 30x569 from a pre-made dataset

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
cancer.keys()

df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df.to_csv(r'input file',index=False)

Then generated a 30x30 output with all the covariance-based PCA components

import pandas as pd
import numpy as np

daily_series = pd.read_csv (r'input path')

sd = daily_series[daily_series.columns[0:daily_series.shape[1]]]
scaled_data = sd #unscaled
from sklearn.decomposition import PCA
pca = PCA(n_components=daily_series.shape[1])
pca_model = pca.fit(scaled_data)
components = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13','PC14','PC15','PC16','PC17','PC18','PC19','PC20','PC21','PC22','PC23','PC24','PC25','PC26','PC27','PC28','PC29','PC30']
variables = daily_series.columns[0:daily_series.shape[1]]
Matrix = pd.DataFrame(pca_model.components_, columns=components, index=variables)

Matrix.to_csv(r'output path', index=True)

When I reorder the columns (let's say alphabetically) of the test input file. And run the above my output from the original test is different not just in the signs but also magnitude. I don't understand how that's possible.

Output (left is original output/right is output after alphabetizing columns in source data):

The main thing to observe is that the SVD of $A$ is given by $$ A = U\Sigma V^\top $$ then for a column permutation via matrix $P$ we have $$ AP = U\Sigma V^\top P $$ because permutation matrices are orthogonal and products of orthogonal matrices are orthogonal. Your components are given by $V^\top P$. The rest is just showing the relationship between PCA and SVD. This is covered thoroughly in https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca — Sycorax, Jan 30 '20 at 00:27
@SycoraxsaysReinstateMonica The transformed variables (principal components) should correspond to the original variables. So why don't they as I highlighted in red? So I'm trying to understand why the python PCA implementation doesn't work as it should. I shouldn't have to write my own implementation right? — Quesop, Jan 30 '20 at 04:36
The transformed variables *do* correspond to the original variables. The reason the two outputs don't match is that one is a permutation of the other. This is what I wrote in my first comment. The python implementation works correctly because this is a property of SVD and orthogonal matrices. — Sycorax, Jan 30 '20 at 04:41
@SycoraxsaysReinstateMonica I know what a permutation is and I know what column is but I don't understand what you mean by "column permutation." Permutation means the order matters. Manifestly the order of the columns matters but why? Column order is trivial so it should not matter. How are the left and the right screenshots equivalent representations? — Quesop, Jan 30 '20 at 04:47
They're not *equivalent,* they're *permutations.* You can rearrange the columns in the two screenshots, and after rearranging, they'll be the same. The order that you use for rearranging is given by $P$. In the permuted case, `pca_model.components_` has computed $V^\top P$. Because permutation matrices have the property that $P^{-1}=P^\top$, you can retrieve the original $V$ by observing $V^\top P P^\top= V$. — Sycorax, Jan 30 '20 at 05:02
@SycoraxsaysReinstateMonica I looked at the output files again and I see you were right. PC1, PC2, PC3 became PC12, PC15, PC11 respectively when I alphabetized columns in my example. Ok this means the relationship between variables after you reduce dimensions is sensitive to the ordering of the columns in the beginning. How do you pick the "best" order of columns in the beginning if your goal is to understand how the variables are interrelated? — Quesop, Jan 30 '20 at 06:12
This seems like a fundamentally different question, distinct from the one in your post. If you've found my answer addresses your original post, please consider upvoting and/or accepting. If you have additional questions, you can click Ask Question (if you feel you need this question for context, you can link to it). — Sycorax, Jan 30 '20 at 06:21
Hey Quesop. I talked to @Sycorax in the comments to their answer, and we figured out that your confusion was because your PCs are in *rows* of your Excel table, and your variable are in *columns*. You somehow have these tables transposed. PC1 cannot become PC12 after changing the column order, this just does not make any sense. PC1 stays PC1, but its elements are reordeded. — amoeba, Feb 06 '20 at 10:18

Sycorax · Accepted Answer · 2020-02-04T14:16:45.077

3

We have thoroughly developed the relationship between SVD and PCA in Relationship between SVD and PCA. How to use SVD to perform PCA? which is worth reviewing if you're uncertain about the connection.

The sklearn PCA implementation is working correctly.

The main thing to observe is that the SVD of $A$ is given by $$ A=U S V^\top $$ so for a permutation of columns via matrix $P$ we have $$ AP=U S V^\top P. $$

Another way to state this is that if you compute the SVD of $AP$, you'll end up with $AP = U S \tilde{V}^\top$, where $\tilde{V}^\top = V^\top P$.

We know that $\tilde{V}^\top=V^\top P$ is orthogonal because permutation matrices are orthogonal and products of orthogonal matrices are orthogonal.

Your screenshots show different things because you're comparing $V^\top$ and $V^\top P$, which are not equal in general. In fact, $V^\top$ and $V^\top P$ are only guaranteed to be equal if $P=I$. Column order matters just for $V$; $U$ and $S$ are the same.

We can even show that a permutation yields the same orthogonal rotation.

$$ \begin{aligned} AV &= USV^\top V \\ AV &= US \end{aligned} $$

And we can show the same result for $AP$ because a permutation matrix $P$ is orthogonal.

$$ \begin{aligned} AP P^\top V &= USV^\top P P^\top V \\ AV &= US \end{aligned} $$

In other words, the column order doesn't matter for creating a linearly independent basis for $A$, because you obtain the same result for $AP$ and $A$.

We can demonstrate this all in Python.

import numpy as np
from numpy.linalg import svd
from numpy.random import shuffle
from sklearn.datasets import load_breast_cancer

if __name__ == "__main__":
  X, y = load_breast_cancer(True)
  U, S, V = svd(X, full_matrices=False)

  P = np.eye(X.shape[1])
  shuffle(P)

  print("X and X @ P are not the same.")
  print(X @ P - X)

  # This will work correctly because both X and the SVD of X are permuted.
  assert np.allclose(U @ np.diag(S) @ V @ P - X @ P, 0.0)

  try:
    # This will fail because X is permuted but the SVD is ~not~.
    assert np.allclose(U @ np.diag(S) @ V - X @ P, 0.0)
  except AssertionError:
    print("V @ P != V")
    print(V @ P - V)

You can replace P with any permutation matrix you desire, even one which alphabetizes the column names.

edited Feb 04 '20 at 14:16

answered Jan 30 '20 at 05:12

Sycorax

76,417
20
189
313

Actually I am still confused by this Q. Shouldn't `PCA` from `sklearn` order components by explained variance? If so, permuting the features in X should not affect the returned V, because its columns are supposed to be sorted. What am I missing? – amoeba Feb 04 '20 at 08:54
They *are* ordered by explained variance. The $U$ and $S$ in my answer are the same for both $AP$ and $A$. The difference is that the $P$ re-arranges which directions maximize variance, kind of like how re-labeling "North" on a compass doesn't change where the North Pole is. Another way to think about it is that the covariance of $AP$ is given by $\frac{1}{n-1} P^\top A^\top AP = P^\top V\frac{S^2}{n-1} V^\top P$, which is a full permutation of $A$'s covariance $V \frac{S^2}{n-1} V^\top$ . Clearly the factors $S, V$ are just re-arranged by $P$, but its values aren't changed. – Sycorax Feb 04 '20 at 14:21
Another way to answer your question is to point out that it's just convention to sort an SVD factorization according to the values of $S$. However, the definition of a singular value and singular vector (and likewise eigenvalue and eigenvector) is still satisfied as long as you match singular values in $S$ to their singular vectors in $U,V$. So for whatever ordering of $S$ you choose, all of the equalities in my answer will still hold, as long as you order $U,V$ to match the ordering in $S$. – Sycorax Feb 04 '20 at 14:27
Sorry but this does not clarify :-( Maybe I am misunderstanding what happens in the question? The breast cancer dataset has 30 features and 500+ samples. OP does PCA on 30x30 cov matrix using sklearn and gets 30 eigenvectors. They should be sorted by the eigenvalues. Now the OP permutes the 30 features and re-runs PCA. The eigenvectors are not the same even accounting for the re-ordering: e.g. the element for "area error" in the 1st eigenvector is 0.08 vs. 0.05. In the comments OP says that PC1 became PC12. This does not make sense to me. – amoeba Feb 04 '20 at 22:41
Suppose $A$ has rank 2 and 2 columns and you swap the two columns. Do you expect $V$ to be the same or to be permuted? Why or why not? Can you relate your intuition to the definitions of eigenvalues and eigenvectors? Drawing a picture should make this more clear but I’m on a train now so I can’t. – Sycorax Feb 05 '20 at 01:31
So for concreteness, if A has measurements of some objects and its columns are Width and Height, and I do PCA and get PC1 which has let's say [0.6, 0.8] weights (so that's the eigenvector with the largest eigenvalue), then if I swap the columns to [Height, Width], then I expect the resulting PC1 to have [0.8, 0.6] weights. The same weights as before but permuted of course, such that the PC1 weight for Height remains the same as it was (0.8). Surely you cannot argue with that??? But this is NOT what I see in the tables that OP posted! [exploding head emoji] – amoeba Feb 05 '20 at 11:59
@amoebasaysReinstateMonica In a comment, OP says "PC1, PC2, PC3 became PC12, PC15, PC11 respectively when I alphabetized columns in my example." But the tables only have the first 3 PCs. So the reason the table doesn't show the same PCs in a different order is because the table is only showing the first 3 PCs, which does not extend to PC 11, 12, ... 15. – Sycorax Feb 05 '20 at 14:40
I think we are talking past each other :) What I tried to show in my concrete example in the comment above with Height/Width, is that if PCs are ordered by variance then PC1 should remain the same (up to reordering of its entries). So [0.6, 0.8] will become [0.8, 0.6]. Do you agree with that? If you do agree, then "PC1 becoming PC12" in OPs comment does not make sense. – amoeba Feb 05 '20 at 23:06
$V^\top P$ is a permutation of the columns of $V^\top$, so what happens in your example is that PC1 is swapped with PC2, not that the elements of PC1 are swapped. (The only non-$I$ permutation matrix for $2\times 2$ covariance matrix is $P=[e_2 ~~ e_1]$). – Sycorax Feb 05 '20 at 23:09
But this does not make any physical sense! Just imagine a diagonally elongated 2D scatter plot with the first eigenvector pointing in the [0.8, 0.6] direction. Now swap X and Y. The scatter plot is still diagonally elongated, with the first eigenvector pointing in the [0.6, 0.8] direction. It is *NOT* pointing in the direction orthogonal to [0.8,0.6]. In terms of algebra and using your notation, PC1 is the first row of $V^\top$. Your $P$ permutes columns of $V^\top$ so it permutes elements of PC1, and not the PC1 with PC2. – amoeba Feb 05 '20 at 23:22
You should probably ask a question about this. I feel like we're talking in circles. Maybe the screenshots are transposed? Maybe the terminology is confused? I'm confident that the algebra works out. I'm less interested in an argument about what a row is. – Sycorax Feb 05 '20 at 23:33
Well I'm saying that your statement (from the comment above) "what happens in your example is that PC1 is swapped with PC2, not that the elements of PC1 are swapped" is wrong. And I explained why. Not sure what your response to that is; do you agree with me? If you don't agree, what specifically in the last sentence of my previous comment do you think is wrong? – amoeba Feb 05 '20 at 23:39
I think all you're saying is that OP's table should be transposed? Or that what I'm calling a PC is a misuse of terminology? Like I said, the terminology discussion is not interesting to me... Just work out the algebra, and you can show that it works, and you don't end up tripping over what someone decided to label a column or a row. – Sycorax Feb 05 '20 at 23:43
Re "I think all you're saying is that OP's table should be transposed? " -- to be honest I didn't think of that, but now that you say it, it does seem like a likely explanation of what happened there :) – amoeba Feb 05 '20 at 23:45
I am sorry, but I cannot agree that it does not matter what the word "PC" refers to here. If you ask sklearn PCA to give you only PC1 with `n_components=1`, you really expect matrix V that you get out to have one column. So that is PC1. Not a row. – amoeba Feb 05 '20 at 23:48
That said, we do seem to be in full agreement as long as the math goes :) – amoeba Feb 05 '20 at 23:49
1

@amoeba I see what you're saying, but I've found that reliance on terminology can confuse the issue, especially because most software documentation is garbage. Instead, I prefer to think through the factorization. So in your example, if I want the product $AV$ to have rank $k$, I just think through the algebra: what shapes of $U,S,V$ do I need for the (truncated) SVD to make sense and produce the desired result? Maybe it's just a personal quirk, but this is the only way I can remember how PCA and SVD work. What's a "score," "loading," or "PC"? I honestly can never remember, but I know the math – Sycorax Feb 05 '20 at 23:52
+1 to this, I think this is a great way to think about it. And I agree that some traditional terminology here (score/loading/etc) is confusing. Nevertheless, let me suggest that we reserve the word "PC1" to whatever comes out of rank-1 approximation! This should be easy for you to remember because it clearly follows the mental picture suggested in your last comment. As we figured out, the confusion that OP had was because his PCs are actually in rows of his Excel tables and his variables are in the columns; I doubt that your answer resolved their confusion (but who knows, maybe it did). – amoeba Feb 06 '20 at 10:17

Why is sklearn PCA implementation in Python sensitive to the order of columns in source data?

1 Answers1