2

I am trying to really get a deep understanding of PCA. From my understanding, a principal component is defined as $$\mathbf{z}_k = \phi_{1,k} \mathbf{x}_1 + \ldots + \phi_{p,k} \mathbf{x}_p = \mathbf{X} \boldsymbol{\phi}_k, \tag{1}$$ where $\boldsymbol{\phi}_k = (\phi_{1,k}, \ldots, \phi_{p,k})$ is a vector of scalars and $\mathbf{x}_j$ is the $j^{\text{th}}$ predictor. In other words, a principal component is a linear combination of the original predictors. The loading vectors $\boldsymbol{\phi}_k$ are chosen to maximize the varaince of the principal components, i.e. maximize $\mathrm{Var}(\mathbf{X} \boldsymbol{\phi}_k)$, and as a result, each loading vector is orthogonal, i.e. $\langle \boldsymbol{\phi}_k , \boldsymbol{\phi}_{\ell} \rangle = 0$ unless $k = \ell$. Also, during the optimization process, we constrain each loading vector to be of unit length, so $\| \boldsymbol{\phi}_k \|_2 = 1$ for all $k$.

If we want to write this more compactly, if the columns of a matrix $\mathbf{Z}$ are the principal components and the columns of $\mathbf{\Phi}$ are the loading vectors, we have $$\mathbf{Z} = \mathbf{X} \mathbf{\Phi}. \tag{2}$$ As a result of the two conditions above, the matrix $\mathbf{\Phi}$ is orthogonal, meaning $\mathbf{\Phi}^{-1} = \mathbf{\Phi}^T$. So multiplying both sides of $(2)$ by $\mathbf{\Phi}^T$ gives us $$ \mathbf{X} = \mathbf{Z} \mathbf{\Phi}^T \tag{3}.$$ It is worth noting that in practice, $(3)$ is calculated using the singular value decomposition $\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T$, where $\mathbf{Z} = \mathbf{U} \mathbf{D}$ and $\mathbf{\Phi} = \mathbf{V}$.

Re-writing the two matrices on the right side of $(3)$ as $\mathbf{Z} = (\mathbf{z}_1, \ldots, \mathbf{z}_p)$ and $\mathbf{\Phi}^T = (\boldsymbol{\phi}_1^T, \ldots, \boldsymbol{\phi}_p^T)^T$, we get $$ \begin{align} \mathbf{X} &= \begin{pmatrix} \mathbf{z}_1 & \cdots & \mathbf{z}_p \end{pmatrix} \begin{pmatrix} \boldsymbol{\phi}_1^T \\ \vdots \\ \boldsymbol{\phi}_p^T \end{pmatrix} \\ &= \mathbf{z}_1 \boldsymbol{\phi}_1^T + \ldots + \mathbf{z}_p \boldsymbol{\phi}_p^T \\ &= (\mathbf{X} \boldsymbol{\phi}_1) \boldsymbol{\phi}_1^T + \ldots + (\mathbf{X} \boldsymbol{\phi}_p) \boldsymbol{\phi}_p^T\tag{4} \\ &= \mathbf{X} \Big( \boldsymbol{\phi}_1 \boldsymbol{\phi}_1^T + \ldots + \boldsymbol{\phi}_p \boldsymbol{\phi}_p^T \Big). \end{align}$$ From this, it has to be true that $\Big( \boldsymbol{\phi}_1 \boldsymbol{\phi}_1^T + \ldots + \boldsymbol{\phi}_p \boldsymbol{\phi}_p^T \Big) = \mathbf{I}$. Here is some more empirecal evidence to show that this is true, using the simple case of two predictors:

set.seed(100)
x1 = rnorm(2000); x2 = x1 + 0.5*rnorm(2000)
mat = matrix(c(x1, x2), ncol = 2)
matsvd = svd(mat)
D = diag(2); diag(D) = matsvd$d
score = matsvd$u %*% D
load = matsvd$v

load[,1] %*% t(load[,1]) + load[,2] %*% t(load[,2])
     [,1] [,2]
[1,]    1    0
[2,]    0    1

My problem is that I cannot come up with an intuitive reason as to why this is true, and I was wondering if anyone could provide one. Is there any significant meaning behind $\boldsymbol{\phi}_k \boldsymbol{\phi}_k^T$? (As answered by @gunes below, since $\mathbf{\Phi}$ is orthogonal and square, we have $\mathbf{\Phi} \mathbf{\Phi}^T = \mathbf{I}$).

EDIT

I would also like to know if my definitions are correct. I stated that $\boldsymbol{\phi}_k$ is the loading vector for the $k^{\text{th}}$ principal component, and so the matrix $\mathbf{\Phi}$ would be the loading matrix. I am getting this definition from section 10.2.1 of An Introduction to Statistical Learning. However, I have also seen (for example, here), loading vector defined as $\boldsymbol{\phi}_k = d_k \boldsymbol{v}_k$, i.e. the $k^{\text{th}}$ loading vector is the $k^{\text{th}}$ right singular vector scaled up by the $k^{\text{th}}$ singular value. So which definition is correct?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
akenny430
  • 121
  • 3
  • 1
    To your last section about definitions. Some people, texts and programs call "loadings" the eigenvector entries and some call it these entries scaled up by the corresponding eigen- (or singular) values. The second way is better for a number of reasons, including linquistic, and I would strongly [recommend](https://stats.stackexchange.com/a/35653/3277) following it. (cont.) – ttnphns Jul 18 '19 at 08:34
  • (cont.) And even in the current Wikipedia article on PCA, if you read it through, you'll find that one paragraph implies word "loadings" is applicable to both unit-scaled direction vectors and them eigenvector-scaled, and another section later defines "loadings" as the label for the second only. – ttnphns Jul 18 '19 at 08:35
  • 1
    Ah okay, so it would be best to have $\mathbf{\Phi}$ be the *principal directions*, and the loadings would be given by $\mathbf{D} \mathbf{\Phi}$. The notational inconsistency is annoying, as it makes learning about something new more difficult than it needs to be! – akenny430 Jul 19 '19 at 01:09

1 Answers1

1

We know (and you also stated) that $\mathbf{\Phi}$ is an orthogonal matrix, i.e. $\mathbf{\Phi}\mathbf{\Phi}^T=\mathbf{I}$. If we open the LHS, we'll have $$\mathbf{\Phi}\mathbf{\Phi}^T=[\mathbf{\phi}_1 \ ...\ \mathbf{\phi}_p]\left[\begin{matrix}\phi_1^T \\ ...\\ \phi_p^T\end{matrix}\right]=\sum_{i=1}^p\phi_i\phi_i^T=\mathbf{I}$$

gunes
  • 49,700
  • 3
  • 39
  • 75