I am analyzing two sets of data (X, Y), which each has 19 samples and 26 variables (columns). Now I am looking for the correlations between the same columns of X and Y, cor(X[,i], Y[,i])
. After multiple comparison corrections, I found some significant correlations in some columns. Now I am thinking how to represent the correlations of X and Y considering all the columns together. I think the canonical correlation may be a good choice. But in my case, as the number of columns is bigger than the number of samples, I should use the regularized canonical correlation.
I found some tutorials online. Two steps in R:
require(CCA)
estim.regul(X, Y, grid1 = seq(0.001, 1, length=50), grid2 = seq(0.001, 1, length = 50), plt=TRUE)
#it gives me lambda1, lambda2 and CV-score
rcc_result <- rcc(X, Y, lambda1, lambda2)
#it gives me the xcoef, ycoef, scores, etc.
I found the correlations between xscore
and yscore
of the first component is very high (r = 0.98). I understand the canonical analysis is trying to maximize this value. The components are linear combinations of the columns in X and Y. But if there is one pair of columns, X[,k] and Y[,k], with fairly good correlations (r=0.6), does this analysis always improve the combinations and make the first components highly correlate? If so, what is the meaning of this analysis?