PCA returns the same pair of principal axes for completely different 2D datasets

Question

I noticed a (seemingly) weird behavior while using sklearn's PCA on 2D standardized datasets: I kept getting the same principal axes: $\pm\left(\begin{gathered}\sqrt{0.5}\\ \sqrt{0.5} \end{gathered} \right)$ and $\pm\left(\begin{gathered}\sqrt{0.5}\\ -\sqrt{0.5} \end{gathered} \right)$ (i.e. the lines $y = x$ and $y = -x$), even when I significantly changed the dataset.

Just to be sure, I wrote a short script that demonstrates this behavior:
(The script creates nonsense datasets from different distributions, and standardizes and performs PCA on each dataset.)

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

dists_funcs = [np.random.chisquare, np.random.exponential, np.random.power,
               np.random.standard_gamma, np.random.weibull, np.random.rayleigh,
               np.random.pareto, np.random.poisson, np.random.standard_t] 
sqrt_of_half = 0.5 ** 0.5
n = 1000

for i, dist_func1 in enumerate(dists_funcs[:-1]):
    dist_func2 = dists_funcs[i + 1]
    X = np.array([(a, a * dist_func2(i + 2)) for a in dist_func1(i + 3, n)])
    pipe = make_pipeline(StandardScaler(), PCA(n_components=2))
    pipe.fit(X)
    pca = pipe.named_steps['pca']
    for principal_axe in pca.components_:
        for z in principal_axe:
            if abs(abs(z) - sqrt_of_half) > 1e-10:
                print(f'got {principal_axe} in {i}')
print('done')

Is this behavior guaranteed? Is there an intuitive explanation for it?

This isn't weird: see the figure just above the "Application" heading I posted at https://stats.stackexchange.com/questions/71260/what-is-the-intuition-behind-conditional-gaussian-distributions/71303#. When you standardize the dataset, there is only one more parameter determined by its first and second moments--the correlation coefficient--which leaves no freedom for anything else (such as the direction of a principal axis) to vary. The sole exception is the case of zero correlation, where the principal axes are undefined. — whuber, Aug 19 '18 at 21:11

score 1 · Answer 1 · answered Aug 19 '18 at 21:10

Here is an explanation for the described behavior and when it would happen, though I don't find it intuitive.

First, the $2$ principal axes are orthonormal eigenvectors of the covariance matrix $C$. This is explained and proved in an answer by amoeba.

Thus, the behavior is guaranteed iff each eigenvector of the covariance matrix is given by $t\left(\begin{gathered}1\\ 1 \end{gathered} \right)$ or $t\left(\begin{gathered}1\\ -1 \end{gathered} \right)$ for some $t\in\mathbb R$.

Since the dataset given to PCA in this case is standardized, the variance of both features (aka variables) is $1$. As $C$ is symmetric, there exists an $a$ such that $C=\left(\begin{matrix}1 & a\\ a & 1 \end{matrix}\right)$.

If $a=0$ (i.e. the variables are uncorrelated), then $C=I$. Thus, $\left(\begin{gathered}1\\ 0 \end{gathered} \right)$ and $\left(\begin{gathered}0\\ 1 \end{gathered} \right)$ are eigenvectors, and the behavior is not guaranteed.

Otherwise, $a\not=0$. Let's find the eigenvalues of $C$: $$\begin{gathered}\left|C-\lambda I\right|=0\\\left|\left(\begin{matrix}1-\lambda & a\\ a & 1-\lambda \end{matrix}\right)\right|=0\\ \left(1-\lambda\right)^{2}-a^{2}=0\\ \left(1-\lambda+a\right)\left(1-\lambda-a\right)=0\\ \lambda=1+a\,,\,1-a \end{gathered}$$

So by definition, $\left(\begin{gathered}x\\ y \end{gathered} \right)$ is an eigenvector of $C$ iff $\left[C\left(\begin{gathered}x\\ y \end{gathered} \right)=(1+a)\left(\begin{gathered}x\\ y \end{gathered} \right)\text{ or }C\left(\begin{gathered}x\\ y \end{gathered} \right)=(1-a)\left(\begin{gathered}x\\ y \end{gathered} \right)\right]$, which (by doing the algebra) holds iff $x=y$ or $x=-y$.

In other words, $\left(\begin{gathered}x\\ y \end{gathered} \right)$ is an eigenvector of $C$ iff $\left(\begin{gathered}x\\ y \end{gathered} \right)$ is given by $t\left(\begin{gathered}1\\ 1 \end{gathered} \right)$ or $t\left(\begin{gathered}1\\ -1 \end{gathered} \right)$ for some $t\in\mathbb R$.
Therefore, the behavior is guaranteed.

By the same reasoning, we could show that the behavior is guaranteed in any case such that $C=\left(\begin{matrix}b & a\\ a & b \end{matrix}\right)$ and $a,b\not=0$.
i.e. we would get the behavior also in cases where we only center the dataset, the features aren't uncorrelated, and have the same variance.

PCA returns the same pair of principal axes for completely different 2D datasets

1 Answers1