1

For my current project I am using sklearn.cross_decomposition.CCA. On several wepages (e.g. https://stats.idre.ucla.edu/r/dae/canonical-correlation-analysis/ or https://www.uaq.mx/statsoft/stcanan.html or https://pure.uvt.nl/ws/portalfiles/portal/596531/useofcaa_ab5.pdf) it says that canonical loadings can be computed as correlations between variables and the canonical variates. However, I could not reproduce this using scipy.stats.pearsonr? Is this just false information or am I doing something wrong?

Here's an example

from sklearn.cross_decomposition import CCA
import numpy as np
from scipy.stats import pearsonr

# compute CCA
X = np.array([[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [3.,5.,4.]])
Y = np.array([[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]])
cca = CCA(n_components=1)
cca.fit(X, Y)
X_c, Y_c = cca.transform(X, Y)

# obtain loadings for X variable set
x_loadings = cca.x_loadings_

print(f"The first variable has a loading of {x_loadings[0]} on the first canonical variate")

# try to manually calculate loadings using pearson correlation. 
r,_ = pearsonr(np.squeeze(X[:,0]),np.squeeze(X_c))

print(f"Correlation between first variable and first canonical variate: {r}")

which gives you:

The first variable has a loading of [0.61454275] on the first canonical variate
Correlation between first variable and first canonical variate: 0.9610851804703184

As you can see those numbers are totally different...

  • Computations of CCA, including the loadings, can be found in https://stats.stackexchange.com/a/77309/3277. – ttnphns Nov 08 '21 at 18:22

2 Answers2

2

CCA has inconsistent nomenclature; there are a few things that I saw being called loadings:

  1. variable weights or parameters that multiply your data and create the canonical variates. Analogous to beta^hat in linear regression, or just weights machine learning algorithms. This is probably what sci-kit learn's cca.x_loadings_ is. You can check if X * x_loadings = X^hat to confirm this, or read the docs.
  2. Pearson's correlation between your X variables and X canonical variates (vice versa for Y). IMO this one is correct, but don't be surprised to see other uses.
  3. Pearson's correlation between your X variables and Y canonical variates (or vice versa), also called cross-loadings.
  4. canonical variates themselves

Another possibility is that you just calculated it wrong by using columns instead of rows or the first row instead of the last row or something like that. I am not saying you did, but it's easy to make mistakes like that in similar problems

rep_ho
  • 6,036
  • 1
  • 22
  • 44
0

I found the solution. Indeed user rep_ho was right, I did something wrong by not standardizing the X matrix before doing the calculations. Doing the following before calculating the correlation solved the problem:

from scipy.stats import zscore
X = zscore(X,ddof=1)

Additional notes (not relevant for this thread):

Further referring to the first point of rep_ho's answer, X_c can be reproduced by multiplying the standardized feature matrix with cca.x_weights:

X_c = np.matmul(X,x_weights)

That means that cca.x_weights is what rep_ho is referring to as "variable weights or parameters".