I'm trying to implement Canonical Correlation Analysis using the Eigen template header c++ library to better my understanding of both the library and the particular statistical technique.
So far, the best reference I've come across for how this is implemented in practice is this documentation of the CCA R package link. As I understand it, the steps presented here are as follows:
- Take input sample matrices $X,Y$ with $n$ observations each (not sure about the case where we have a different number of rows in them) and where the number of columns in each ($p,q$) are the variable columns.
- Compute $S_{xx} := X^tX$ and $S_{yy} := Y^tY$
- Compute $P_x := \frac{1}{n}X(S_{xx})^{-1}X^t$ and $P_y := \frac{1}{n}Y(S_{yy})^{-1}Y^t$
- Compute the eigen decomposition of $P_xP_y$ to get the squared canonical correlations (eigenvalues) and the canonical variables $U$ (eigenvectors) then repeat for $P_yP_x$ to get the other canonical variables ($V$).
So far so good, I know the Eigen library has eigen solvers - is there a way to avoid doing two different eigen decompositions for step #4? Once you do $P_xP_y$ maybe there's a more efficient way of getting the eigenvectors of $P_yP_x$?
The paper also mentions what seems to be an extremely computationally intense regularization approach where we grid over values for two parameters $\lambda_1,\lambda_2$ to 'regularize' possibly ill-conditioned $S_{xx},S_{yy}$ matrices. Then it does leave-one-out cross validation which effectively has to repeat the 4th step $n$ times. If the grid is $100\times 100$ that's $10,000 \times n$ total times we must repeat step 4, which seems insane. Am I interpreting that correctly? Are there other alternatives that don't explode like this in terms of computation compared to the unregularized variant?