5

Given two vectors of random variables $X$ and $Y$, Canonical Correlation Analysis (CCA) finds the transformation matrices $A$ and $B$ so that $\operatorname{corr}(A_{1*} X, B_{1*} Y)$ is first maximal, $\operatorname{corr}(A_{2*} X, B_{2*} Y)$ is then maximal subject to $\operatorname{corr}(A_{1*} X, A_{2*} X) = 0$ and $\operatorname{corr}(B_{1*} Y, B_{2*} Y) = 0$, etc.

Is there any global objective function that $A$ and $B$ also optimize? For instance, do they maximize $\sum_i \operatorname{corr}(A_{i*} X, B_{i*} Y)$ subject to $A^TA=I$ and $B^TB=I$, or something around this line?

Related to that, if we define a transformation matrix $W = B^{-1}A$, is there any relation between $WX$ and $Y$ for which $W$ is optimal? In particular, is it possible to establish some connection between this transformation $W$ and the optimization objective of Ordinary Least Squares (OLS)?

amoeba
  • 93,463
  • 28
  • 275
  • 317
statotito
  • 53
  • 6

1 Answers1

9

If $X$ is $n\times p$ and $Y$ is $n\times q$, then one can formulate the CCA optimization problem for the first canonical pair as follows:

$$\text{Maximize }\operatorname{corr}(Xa, Yb).$$

The value of the correlation does not depend on the lengths of $a$ and $b$, so they can be arbitrarily fixed. It is convenient to fix them such that the projections have unit variances:

$$\text{Maximize }\operatorname{corr}(Xa, Yb) \text{ subject to } a^\top \Sigma_X a=1 \text{ and } b^\top \Sigma_Yb=1,$$

because then the correlation equals the covariance:

$$\text{Maximize } a^\top \Sigma_{XY}b \text{ subject to } a^\top \Sigma_X a=1 \text{ and } b^\top \Sigma_Yb=1,$$

where $\Sigma_{XY}$ is the cross-covariance matrix given by $X^\top Y/n$.


We can now generalize it to more than one dimension as follows:

$$\text{Maximize }\operatorname{tr}(A^\top \Sigma_{XY}B) \text{ subject to } A^\top \Sigma_X A=I \text{ and } B^\top \Sigma_Y B=I,$$

where the trace forms precisely the sum over successive canonical correlation coefficients, as you hypothesized in your question. You only had the constraints on $A$ and $B$ wrong.

The standard way to solve CCA problem is to define substitutions $\tilde A = \Sigma_X^{1/2} A$ and $\tilde B = \Sigma_Y^{1/2} B$ (conceptually this is equivalent to wightening both $X$ and $Y$), obtaining

$$\text{Maximize }\operatorname{tr}(\tilde A^\top \Sigma_X^{-1/2} \Sigma_{XY}\Sigma_Y^{-1/2} \tilde B) \text{ subject to } \tilde A^\top \tilde A=I \text{ and } \tilde B^\top \tilde B=I.$$

This is now easy to solve because of the orthogonality constraints; the solution is given by left and right singular vectors of $\Sigma_X^{-1/2} \Sigma_{XY}\Sigma_Y^{-1/2}$ (that can then easily be back-transformed to $A$ and $B$ without tildes).


Relationship to reduced-rank regression

CCA can be formulated as a reduced-rank regression problem. Namely, $A$ and $B$ corresponding to the first $k$ canonical pairs minimize the following cost function:

$$\Big\|(Y-XAB^\top)\Sigma_Y^{-1/2}\Big\|^2 = \Big\|Y\Sigma_Y^{-1/2}-XAB^\top\Sigma_Y^{-1/2}\Big\|^2.$$

See e.g. Torre, 2009, Least Squares Framework for Component Analysis, page 6 (but the text is quite dense and might be a bit hard to follow). This is called reduced-rank regression because the matrix of regression coefficients $AB^\top\Sigma_Y^{-1/2}$ is of low rank $k$.

In contrast, standard OLS regression minimizes

$$\|Y-XV\|^2$$

without any rank constraint on $V$. The solution $V_\mathrm{OLS}$ will generally be full rank, i.e. rank $\min(p,q)$.

Even in the $k=p=q$ situation there still remains one crucial difference: for CCA one needs to whiten dependent variables $Y$ by replacing it with $Y\Sigma_Y^{-1/2}$. This is because regression tries to explain as much variance in $Y$ as possible, whereas CCA does not care about the variance at all, it only cares about correlation. If $Y$ is whitened, then its variance in all directions is the same, and the regression loss function starts maximizing the correlation.

(I think there is no way to obtain $A$ and $B$ from $V_\mathrm{OLS}$.)

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Shouldn't it be $\operatorname{tr}(\tilde A^\top \Sigma_X^{-1/2} \Sigma_{XY}\Sigma_Y^{-1/2} \tilde B)$? Also, I would be very grateful if you could give me some clue about the second (and main) question. – statotito Mar 18 '16 at 21:06
  • @statotito Yes, of course there should be minuses, thanks! I fixed it. Regarding your second question: note that $B$ is a rectangular matrix of e.g. $q \times 2$ size in case one wants to extract $2$ canonical pairs. As such, it cannot be inverted and your $W=B^{-1}A$ does not really make sense. Do you have some context for your second question? I know several formulations of CCA that make it appear a bit similar to a regression problem, but none of them uses $B^{-1}$ (which is undefined). – amoeba Mar 18 '16 at 21:28
  • I am only interested in the case where both $X$ and $Y$ are $n \times p$ and we extract $p$ canonical pairs. In this particular case, $B$ would be a $p \times p$ square matrix, so $B^{-1}$ should be defined and I think that $W = B^{-1}A$ could make some sense (but I might be wrong). I am mainly interested in finding some connection between an $W$ that relates these $A$ and $B$ from CCA (be it defined as $W=B^{-1}A$ or some other way) and the optimization objective of ordinary least squares, possibly with some constraints. – statotito Mar 18 '16 at 21:46
  • Please see my update. – amoeba Mar 19 '16 at 00:01
  • My sense is CCA is quite closely related to RRR but the objective is different. CCA is more of an errors-in-variables method, much in the way TLS (total least squares or orthogonal least squares) is. RRR, the basis for Johansen's method, has a likelihood (conditional on rank) and one is minimizing the SSE for all the equations given cross-equation restrictions. CCA on the other hand, can be thought of as a regression where the independent variables are each observed with error. Each of these methods is a form of regularisation (if you actually reduce the rank). – NBF Aug 01 '18 at 12:05
  • Hi @amoeba, in your answer in https://stats.stackexchange.com/questions/179733/theory-behind-partial-least-squares-regression, you mentioned that OLS is to maximize the correlation between X and Y. But here, correct me if I am wrong, it seems that OLS is trying to maximize covariance since you mention CCA is the one that does not depend on variance of Y. So it seems to be a conflict and I could not think of a way to cast OLS into a covariance/correlation form, is there a direct relation of am I missing something. Thx! – Vickyyy Jun 15 '19 at 18:26
  • @Vickyyy In the answer you link I am talking about univariate Y (usually denoted as lowercase y). In this case, OLS indeed maximized the correlation between X and y, but this is equivalent to maximizing the "explained variance" which is correlation times the variance in Y. For multivariate Y, these are two different things: reduced-rank multivariate regression maximizes explained variance, whereas CCA maximizes correlation. – amoeba Jun 15 '19 at 21:16
  • What sort of textbooks is the last bit found in? – Trajan Aug 25 '20 at 20:29