8

This question results from the discussion following a previous question: What is the connection between partial least squares, reduced rank regression, and principal component regression?

For principal component analysis, a commonly used probabilistic model is $$\mathbf x = \sqrt{\lambda} \mathbf{w} z + \boldsymbol \epsilon \in \mathbb R^p,$$ where $z\sim \mathcal N(0,1)$, $\mathbf{w}\in S^{p-1}$, $\lambda > 0$, and $\boldsymbol\epsilon \sim \mathcal N(0,\mathbf{I}_p)$. Then the population covariance of $\mathbf{x}$ is $\lambda \mathbf{w}\mathbf{w}^T + \mathbf{I}_p$, i.e., $$\mathbf{x}\sim \mathcal N(0,\lambda \mathbf{w}\mathbf{w}^T + \mathbf{I}_p).$$ The goal is to estimate $\mathbf{w}$. This is known as the spiked covariance model, which is frequently used in the PCA literature. The problem of estimating the true $\mathbf{w}$ can be solved by maximizing $\operatorname{Var} (\mathbf{Xw})$ over $\mathbf{w}$ on the unit sphere.

As pointed out in the answer to the previous question by @amoeba, reduced rank regression, partial least squares, and canonical correlation analysis have closely related formulations,

\begin{align} \mathrm{PCA:}&\quad \operatorname{Var}(\mathbf{Xw}),\\ \mathrm{RRR:}&\quad \phantom{\operatorname{Var}(\mathbf {Xw})\cdot{}}\operatorname{Corr}^2(\mathbf{Xw},\mathbf {Yv})\cdot\operatorname{Var}(\mathbf{Yv}),\\ \mathrm{PLS:}&\quad \operatorname{Var}(\mathbf{Xw})\cdot\operatorname{Corr}^2(\mathbf{Xw},\mathbf {Yv})\cdot\operatorname{Var}(\mathbf {Yv}) = \operatorname{Cov}^2(\mathbf{Xw},\mathbf {Yv}),\\ \mathrm{CCA:}&\quad \phantom{\operatorname{Var}(\mathbf {Xw})\cdot {}}\operatorname{Corr}^2(\mathbf {Xw},\mathbf {Yv}). \end{align}

The question is, what are the probabilistic models behind RRR, PLS, and CCA? In particular, I am thinking about $$(\mathbf{x}^T, \mathbf{y}^T)^T \sim \mathcal N(0, \mathbf{\Sigma}).$$ How does $\mathbf{\Sigma}$ depend on $\mathbf{w}$ and $\mathbf{v}$ in RRR, PLS, and CCA? Moreover, is there a unified probabilistic model (like the spiked covariance model for PCA) for them?

Minkov
  • 415
  • 3
  • 7
  • Hi @Moskowitz. Is my answer going in the direction that you were hoping? I can see that it does not fully answer your question, but I would be happy to get some feedback and also interested to know your thoughts about it. I could extend my description of PCCA if you want; not-really-existing "PPLS" is something that I've been thinking about it a couple of years ago and find myself thinking about again. So would be curious to hear your thoughts about it. – amoeba Apr 18 '16 at 12:52
  • Hi @amoeba. Thanks so much for the answer. Sorry for the delayed response. I have been thinking about viewing the PPLS as another multi-view model, where the two x variables have different distributions, but haven't quite succeeded. I will try to add to your answer if I can figure out ;) – Minkov Apr 18 '16 at 22:14

1 Answers1

6

Probabilistic canonical correlation analysis (probabilistic CCA, PCCA) was introduced in Bach & Jordan, 2005, A Probabilistic Interpretation of Canonical Correlation Analysis, several years after Tipping & Bishop presented their probabilistic principal component analysis (probabilistic PCA, PPCA).

Very briefly, it is based on the following probabilistic model:

\begin{align} \newcommand{\z}{\mathbf z} \newcommand{\x}{\mathbf x} \newcommand{\y}{\mathbf y} \newcommand{\m}{\boldsymbol \mu} \newcommand{\P}{\boldsymbol \Psi} \newcommand{\S}{\boldsymbol \Sigma} \newcommand{\W}{\mathbf W} \newcommand{\I}{\mathbf I} \newcommand{\w}{\mathbf w} \newcommand{\u}{\mathbf u} \newcommand{\0}{\mathbf 0} \z &\sim \mathcal N(\0,\I) \\ \x|\z &\sim \mathcal N(\W_x \z + \boldsymbol \m_x, \P_x)\\ \y|\z &\sim \mathcal N(\W_y \z + \boldsymbol \m_y, \P_y) \end{align}

Here noise covariances $\P_x$ and $\P_y$ are arbitrary full rank symmetric matrices.

PCCA graphical model

If we consider 1-dimensional latent variable $z$, assume that all means are zero $\m_x=\m_y=0$, and combine $\x$ and $\y$ into one vector, then we get:

$$\begin{pmatrix} \x\\ \y\end{pmatrix}\sim\mathcal N (\0,\S),\quad\quad\quad\S=\begin{pmatrix}\w_x\w_x^\top+\P_x & \w_x\w_y^\top \\ \w_y\w_x^\top & \w_y\w_y^\top+\P_y\end{pmatrix}.$$

Bach & Jordan proved that this is equivalent to standard CCA. Specifically, the maximum likelihood (ML) solution is given by $$\w_i = \S_i\u_i m_i,$$ where $\S_i$ are sample covariance matrices of both datasets, $\u_i$ is the first canonical pair of axes, and $m_x m_y = \rho_1$ are arbitrary numbers (both between $0$ and $1$) giving first canonical correlation as a product.

As you see, $\w_i$ are not directly equal to the CCA axes, but are given by some transformation of those. See Bach & Jordan for more details.


I don't have a good intuitive grasp of PCCA. As you can see, the cross-covariance matrix between $X$ and $Y$ is modeled by $\w_x \w_y^\top$, so one could naively expect $\w_i$ to rather yield PLS axes. The ML solution is however related to the CCA axes. It probably is somehow due to the block-diagonal structure of $\P=\begin{pmatrix}\P_x & \0\\ \0 & \P_y\end{pmatrix}$.

I am not aware of any similar probabilistic versions of RRR or PLS, and have failed to come up with any myself. Note that if $\P$ is diagonal then we obtain FA on the combined $X+Y$ dataset, and if it is diagonal and isotropic then we get PPCA on the combined dataset. So there is a progression from CCA to FA to PPCA, as $\P$ gets more and more constrained. I don't see what other choices of $\P$ can be reasonable.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Just in short: what means "isotropic" in your second-last sentence? – Gottfried Helms Apr 15 '16 at 21:24
  • @Gottfried, it means that it is diagonal and all elements on the diagonal are equal, i.e. $\P = \sigma^2 \I$. I will edit to clarify. – amoeba Apr 15 '16 at 21:29
  • I see, thanks. I had implemented such a structure in my pca/factor-program not knowing its name, based on a hint of S. Mulaik in his 1972-book. Good to know... – Gottfried Helms Apr 16 '16 at 08:09