14

I perform Principal Component Analysis using two variables that are standardized. This is done by applying a SVD on the correlation matrix of the concerned variates. However, the SVD gives me the same eigenvector (weights) irrespective of what the two variables are. It's always [.70710678, .70710678]. I find this strange. Of course, the eigenvalues differ.

My question is: How to interpret this?


PS. I wanted to conduct a total least squares regression on two variables. My statistical programme does not provide TLS, but TLS luckily equals Principal Component Analysis, as far as I know. Hence my question. The question is not about TLS directly, but why I get the same eigenvectors irrespective of which variables I use (as long as they are exactly 2).

amoeba
  • 93,463
  • 28
  • 275
  • 317
MaHo
  • 391
  • 2
  • 11

2 Answers2

18

Algebraically, correlation matrix for two variables looks like that: $$\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}.$$ Following the definition of an eigenvector, it is easy to verify that $(1, 1)$ and $(-1, 1)$ are the eigenvectors irrespective of $\rho$, with eigenvalues $1+\rho$ and $1-\rho$. For example:

$$\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\begin{pmatrix}1\\1\end{pmatrix}=(\rho+1)\begin{pmatrix}1\\1\end{pmatrix}.$$

Normalizing these two eigenvectors to unit length yields $(\sqrt{2}/2, \sqrt{2}/2)$ and $(-\sqrt{2}/2, \sqrt{2}/2)$, as you observed.

Geometrically, if the variables are standardized, then the scatter plot will always be stretched along the main diagonal (which will be the 1st PC) if $\rho>0$, whatever the value of $\rho$ is:

Two standardized variables with various correlation coefficients

Regarding TLS, you might want to check my answer in this thread: How to perform orthogonal regression (total least squares) via PCA? As should be pretty obvious from the figure above, if both your $x$ and $y$ are standardized, then the TLS line is always a diagonal. So it hardly makes sense to perform TLS at all! However, if the variables are not standardized, then you should be doing PCA on their covariance matrix (not on their correlation matrix), and the regression line can have any slope.


For a discussion of the case of three dimensions, see here: https://stats.stackexchange.com/a/19317.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • If you do PCA on a covariance matrix for two variables (variables not standardized), will it yield an "unstandardized" 45 degree line? By which I mean, a line with slope sigma_y/sigma_x, that goes through the point (mean_x, mean_y)? – Richard DiSalvo Jul 26 '21 at 17:16
3

As your first eigenvector is $(\sqrt{2}, \sqrt{2})$, the other eigenvector is uniquely (we're in 2D) up to factor $1$/$-1$ the vector $(\sqrt{2} -\sqrt{2})$. So you get your diagonalizing orthogonal matrix as $$\sqrt{2}\left[ \begin{array}{cc} 1 & 1 \\ 1 & -1 \end{array} \right]$$

No we can reconstruct the covariance* matrix to have the shape $$\left[ \begin{array}{cc} a+b & a-b \\ a-b & a+b \end{array} \right] $$ $a$ and $b$ are the eigenvalues. I would suggest to look closely on your model or the origin of the data. Then you might find a reason why your data may be distributed as $X_1=X_a + X_b$ and $X_2 = X_a - X_b$, where $Var(X_a)=a$ and $Var(X_b)=b$ and $X_a$ and $X_b$ are independent.

If your data would follow a continuous multivariate distribution, it is almost sure that your correlation matrix follows from this sum/difference relation. If the data follow a discrete distribution, it is still very likely that the model $X_1=X_a + X_b$ and $X_2 = X_a - X_b$ describes your data properly. In this case, you don't need a PCA.

But it is generally better to infer such relations from sure insight into the nature of the data and not by estimation procedures like PCA.

*Say correlation matrix, if $a+b=1$.

Horst Grünbusch
  • 5,020
  • 17
  • 22
  • 1
    I am confused by this answer. Why is it "obvious"? Also, the claim in the question is that the eigenvectors of any 2x2 correlation matrix are the same. Is it true? If so, why? The model or origin of the data are of no relevance here. – amoeba Mar 05 '15 at 11:27
  • The model and origin of the data are of relevance, because that's what the question is about. I'll clarify the word "obvious", because you're right, this word is always a trap for proofs. – Horst Grünbusch Mar 05 '15 at 11:31