1

I am analyzing two sets of data (X, Y), which each has 19 samples and 26 variables (columns). Now I am looking for the correlations between the same columns of X and Y, cor(X[,i], Y[,i]). After multiple comparison corrections, I found some significant correlations in some columns. Now I am thinking how to represent the correlations of X and Y considering all the columns together. I think the canonical correlation may be a good choice. But in my case, as the number of columns is bigger than the number of samples, I should use the regularized canonical correlation.

I found some tutorials online. Two steps in R:

require(CCA)

estim.regul(X, Y, grid1 = seq(0.001, 1, length=50), grid2 = seq(0.001, 1, length = 50), plt=TRUE) 

#it gives me lambda1, lambda2 and CV-score

rcc_result <- rcc(X, Y, lambda1, lambda2)

#it gives me the xcoef, ycoef, scores, etc.

I found the correlations between xscore and yscore of the first component is very high (r = 0.98). I understand the canonical analysis is trying to maximize this value. The components are linear combinations of the columns in X and Y. But if there is one pair of columns, X[,k] and Y[,k], with fairly good correlations (r=0.6), does this analysis always improve the combinations and make the first components highly correlate? If so, what is the meaning of this analysis?

amoeba
  • 93,463
  • 28
  • 275
  • 317
yue
  • 11
  • 1
  • 1
    It's not clear if you are asking about how to interpret regularized CCA as opposed to normal CCA, or about how to interpret CCA at all. Can you answer your last two questions (two last sentences) for the case of non-regularized CCA? – amoeba May 24 '16 at 16:24
  • Thanks for the comment, Amoeba. You got me. I can not answer the first question, which puzzles me the most. I generally understand that the canonical correlation determines coefficients for two groups of vectors in order to maximize the correlation of the two combinations. I like this illustratio very much. [link] (http://stats.stackexchange.com/questions/65692/how-to-visualize-what-canonical-correlation-analysis-does-in-comparison-to-what) – yue May 24 '16 at 21:08
  • For normal canonical correlation, it is reasonable to me because with fewer columns (vectors) there is less chance to "manipulate" the vectors to maximize the correlation. Then the results will guide me to weight the columns to get a good correlation. However, with more columns than samples, in my case, with regulariztion, I notice the correlation of canonical variates is too strong, r=0.98. I am suspicious... I think it is because I have too many variables but small samples. – yue May 24 '16 at 21:08
  • I think you are right to be suspicious. 0.98 looks very much like overfitting. I have no idea what the `estim.regul` is doing, but I suspect that it's not doing its job well. In general, having 19 observations and 26 variables, I don't think you can hope to get anything useful from CCA/rCCA or related methods. You just don't have enough data. – amoeba May 24 '16 at 21:42

0 Answers0