Why forward and reverse transforms for PPCA are so different?

Question

In probabilistic PCA (PPCA), they model dimensionality reduction as probabilistic model

$t = Wx + \mu + \epsilon$

where

$W$ is non-swuare matrix of size $(d \times q)$, $q < d$

$x$ is a vector from $q$-dimensional "latent" space

$t$ is a vector from $d$ dimensional "observable" space

$\epsilon$ is normally distributed random variable, covering "insufficiency" or transform from low-dimensional space to high-dimensional.

So, forward transform (from latent space to observable space) is performed by matrix

$W$

and reverse transform is performed by matrix

$(\sigma^2 I + W^T W)^{-1} W^T$

(formula for "posterior" mean in the middle of paragraph in chapter 3.3)

I know matrices are not square, but why is it so different?

UDPATE

Now I saw formula (16) which claims reverse transform is done with the matrix

$W(W^T W)^{-1}(\sigma^2 I + W^T W)$

anyway, the question is why is it so complex and is there any notion of matrix inverse for non-square matrices in this case?

usεr11852 · Answer 1 · 2017-10-22T00:21:02.170

You are looking at a hat matrix. The PPCA reconstruction is defined similarly to a regression model; what is described as $W(W^TW)^{-1}M$ (Eq. 16) is little more than the hat matrix of a Tikhonov-regularised regression task in latent space.

In the case of probabilistic PCA as well as ridge regression we assume a covariance $C$ such that $C = WW^T + \sigma^2 I$ where we capture "uncounted variance" in $\sigma^2$. In the the case of probabilistic PCA we call it "noise variance" while in the case of ridge regression "regularisation parameter / Tikhonov factor". Remember that $\langle x_n \rangle = M^{-1}W(t_n - \mu)$ (Eq. 55); simple substitution of Eq. 55 into Eq. 16 yields the well-known formulation of a hat matrix (which is actually presented in Eq. 68). On that matter CV has already a very good answer on how ridge regression relates to PCA here.

To emphasize how ubiquitous this formulation is: $B$-spline basis functions do exactly the same task within the context of non-parametric regression. In that case $\hat{\beta} = (B^TB + \lambda \Omega)^{-1}B^Ty$. Here instead of having a non-parametric basis as with the eigen-components provided by standard PCA, we use a set B-splines basic function $B$. To quote L. Wasserman directly: "The effect of the term $\lambda \Omega$ is to shrink the regression coefficients towards a subspace, which results in a smoother fit."$^\dagger$ Again, same idea. $B$-splines regression calls this fit "smoother", ridge regression calls this fit "regularised" and PPCA calls this "optimally reconstructed".

Going back to probabilistic PCA, in the case of $\sigma^2 =0$ the above mentioned formula (Eq. 16) would directly reduce to a conventional PCA. In that case, the matrix $W$ would form an orthogonal basis and therefore the "reverse transformation" would simply require $W^T$.

As a side-note: Yes, there is a notion of matrix inverse for non-square matrices: the Moore-Penrose Pseudo-inverse. Using it for PCA is like using a battle-tank for pizza delivery; doable but most probably daft. I have only seen it used for pedagogical purposes and being actually useful when handling small, nearly rank-deficient covariance matrices (ie. a very particular application).

$\dagger$. Larry Wasserman, "All of Nonparametric Statistics", Chapt. 5.5 "Penalized Regression, Regularization and Splines"

I wish you explain more what exactly is the connection between PPCA and RR. I know both methods quite well, but to be honest it never occurred to me that they are related. — amoeba, Oct 22 '17 at 00:08
@amoeba: Thank you for your comment. You have given a great answer [here](https://stats.stackexchange.com/q/133257) to this question on relating PCA and RR. I mean... I actually thought of linking it for a moment (and then forgot to do it - fixed now). To quote you: "*This means that ridge regression can be seen as a "smooth version" of PCR.*" which is in line with my own answer. I just make the connection even more obvious as PPCA explicitly defines a "noise model". — usεr11852, Oct 22 '17 at 00:23
Hmm. Interesting. That answer is about how PC regression (PCR) is related to RR, whereas this Q is about PCA/PPCA itself, without any regression. Also, the interesting part is the term with $\sigma$... Anyway, I should think more about this. Might come back here later. — amoeba, Oct 23 '17 at 21:13

Why forward and reverse transforms for PPCA are so different?

1 Answers1