Scaling by Eigenvalues in PCA and Probabilistic PCA

Question

I have a question about probabilistic PCA (PPCA) and regular PCA, particularly regarding transforming to and from the latent space. The main question (detailed in the following) is: when are the eigenvalues of the covariance matrix used in the transformations?

In all cases, I assume $X \in \mathbb{R}^{m\times n}$, where each row is a datum and each column is a feature, but a single datum is written as a column vector $x_j\in\mathbb{R}^{n\times 1}\equiv \mathbb{R}^{n}$ when alone (unfortunate habits of some fields). Let's also assume $X$ is centered, for simplicity.

PCA

There are two methods to write the PC decomposition. One is to use the SVD of the data matrix: $$ X = U\Sigma V^T\;\;\;\implies\;\;\; XV = U\Sigma =: Z $$ thus we have that $V^T$ is the transformation matrix; i.e., $$ z_\ell = V^T x_\ell $$ is the mapping from the data space to the latent space. Geometrically, the columns of $V$ are orthogonal axes of the principal space, and we are simply projecting onto them.

The other method is eigendecomposition of the covariance matrix: $$ \hat{C} = V\Lambda V^T\;\;\; \text{where} \;\;\;\hat{C}=\frac{1}{n-1} X^T X$$ so that the eigenvectors (columns of $V$) of $\hat{C}$ can form the basis of a new space. The eigenvalues $\lambda_i$ in $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$ are often called the "explained variance" of axis $i$.

For dimensionality reduction, we take only the first $k$ columns of $V$ as the basis of the latent space (so $z_j\in\mathbb{R}^k$), truncating $V$ into a new matrix $V_k \in \mathbb{R}^{n \times k}$. Then the transformation equations are: \begin{align} z &= V_k^T x \\ \hat{x} &= V_k z \end{align} where $x$ and $z$ are any vectors in the data and latent space respectively. Notice we do not use $\Lambda$ or $\Sigma$ - hopefully this is correct.

PPCA

The PPCA article assumes the relation between $z$ and $x$ can be modelled by a probability model: \begin{align} p(x|z) &= \mathcal{N}( Wz + \mu, \sigma^2 I) \\ \mathbb{E}[z|x] &= M^{-1} W^T (x - \mu) \end{align} with $z\sim \mathcal{N}(0,I)$, $M = W^TW + \sigma^2 I$, and $\mu= 0$ since $X$ is centered.

The authors show that the maximum likelihood estimate for $W$ (for a given $k$) is given by: $$ W = V_k (\Lambda_k - \sigma^2 I)^{1/2} R, $$ where $R$ is an arbitrary orthogonal matrix. So $W$ is not the orthogonal matrix $V$ of eigenvectors - it is instead a rotation and axis-wise scaling of the original principal directions. I'll assume $R = I$ (as in the original article).

So the transformations now look as follows. Given a point in latent space, the mean of the data-space posterior can be computed as \begin{align} \mathbb{E}[x|z] &= Wz = V_k (\Lambda_k - \sigma^2 I)^{1/2} z \end{align} while the expected value of the latent space posterior given a datum $x$ is given by $$ \mathbb{E}[z|x] = (W^TW + \sigma^2 I)^{-1} W^T x. $$ Now suppose $\sigma\rightarrow 0$. Then \begin{align} \mathbb{E}[x|z] &= V_k \sqrt{\Lambda_k} z \end{align} transforms from latent space to data space, and \begin{align} \mathbb{E}[z|x] &= (W^TW + \sigma^2 I)^{-1} W^T x \\ &= (W^TW)^{-1} W^T x \\ &= (\sqrt{\Lambda_k} V_k^T V_k \sqrt{\Lambda_k})^{-1} \sqrt{\Lambda_k} V_k^T x \\ &= \Lambda_k^{-1/2} \underbrace{V_k^T V_k}_I V_k^T x \\ &= \Lambda_k^{-1/2} V_k^T x \\ \end{align} transforms from data space to latent space.

Notice that this transformation (1) is not the same as that of PCA, despite the disappearance of $\sigma$, and (2) there is the presence of $\Lambda$ in these transformations. It merely scales the axes by the standard deviations, as far as I can tell, but why does it appear here and not in PCA? I suppose for reconstruction (as long as one is consistent) it doesn't matter, but how does it affect either mapping (i.e., $z\rightarrow x$ and $x\rightarrow z$ separately)?

Furthermore, looking at this nice answer, the answerer notes that "one selects a -dimensional vector that represents the point in the "reduced" -space of dimensions, then to map it back to dimensions one needs to multiply it with $S_k V_k^T$" where $S_k = \sqrt{\Lambda_k} = \Sigma_k$. The eigenvalues seemingly do not appear elsewhere in the transformations.

Question Summary

When should we use the eigenvalues of the covariance matrix in scaling the components during PCA and PPCA? Is there any difference caused by PPCA vs regular PCA? I am asking specifically in the context of transforming to and from the latent space.

I'd think this depends on what the Principal Components are used for. If it's just for visualisation, it doesn't matter much, and neither does it matter if PCs are used as input for scale equivariant methods. As input for a not scale equivariant method, the unscaled version preserves the information in the original variances, i.e., the first PC has a variance according to the percentage of variation information represented in it. My gut feeling is that more often than not this is what we want, but scaling may also be good in some situations. — Christian Hennig, Feb 24 '20 at 23:56
Thanks @Lewian! So, is using $\Lambda$ within PPCA is unnecessary? I am mostly concerned with using the PCA/PPCA as a "layer" within a larger probabilistic model. Essentially, the scaling with $\Lambda$ is destroying some of the variance information, so I suppose I should not do it. But how does it change the probability density? I noticed that $\Lambda^{1/2}$ essentially "stretches" $z$ during the mapping to $x$; is this perhaps necessary for the probabilistic interpretation, because $z\sim\mathcal{N}(0,I)$? — user3658307, Feb 25 '20 at 06:47
As I wrote before, this depends on what the PCs are used for. Unfortunately it's not quite clear to me what precisely you mean by "using the PCA/PPCA as a "layer" within a larger probabilistic model", so I don't currently have an opinion about that. Ultimately one can model scaled and unscaled PCs; obviously the unscaled ones will have a covariance matrix different from $I$. — Christian Hennig, Feb 25 '20 at 11:59
@Lewian There has been some work about certain regularized autoencoders and Linear VAEs converging to PPCA. Meaning for some e.g. deep generative models it is a bit similar to repeated PPCA between non-linearities. I meant "layer" in the "ML" sense. I was just confused about the presence vs absence of $\Lambda$ in various sources. Your comments were helpful; if you want to make them an answer I'll accept. One last qualm though: how to interpret/understand the presence of $\sigma$ in the PPCA transform but not in PCA? Is $\sigma$ only related to the dim reduction or noise in the data too? — user3658307, Feb 28 '20 at 21:04
Chances are I don't understand PPCA precisely, however to me this looks like a connection to factor analysis, where every variable can have it's independent "noise" on top of what is explained by the factors/principal components. If you allow such a thing, it makes sense to allow this to have any variance $\sigma^2$ (one could even think about allowing a different variance for every variable, but that may make things too complicated). Note that "without the $\sigma^2$" the covariance matrix wouldn't be ${\bf I}$ but ${\bf 0}$. The variables' variance would be determined by the $W$. — Christian Hennig, Feb 29 '20 at 00:55
@Lewian Yeah, that's what's odd to me. when there's no dim reduction, the $\sigma$ disappears. (in fact, it seems expanding the dimensionality of the data e.g. by appending constant dims (so adding a dim with $\lambda=0$) would reduce $\sigma$ for fixed $k$... why would adding empty dims reduce noise?). Anyway, yes, it seems like there is an assumption of no additional noise sources (the independent noise you mentioned). Factor analysis might be a good place to look to understand better, thanks. — user3658307, Feb 29 '20 at 02:53
Actually there is dimension reduction. The PPCA model assumes that $W^TW$ does not have full rank. In PCA, a full rank will reproduce the original distribution in rotated and standardised form (no additional noise required to fully reproduce the data); if you reduce the rank, you lose information (although you keep maximum variance directions). In PPCA/factor analysis, it is assumed that there's a reduced rank space, and all additional variation is assumed independent on the different variables ($Cov=\sigma^2{\bf I}$). — Christian Hennig, Mar 01 '20 at 00:56

Scaling by Eigenvalues in PCA and Probabilistic PCA

0 Answers0