5

I do canonical correlation analysis between two multivariate datasets $X$ and $Y$. For each pair of canonical variates (x-y pair) I get the canonical correlation coefficient. How can I test its statistical significance?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
user32256
  • 51
  • 1
  • 2

1 Answers1

9

Let $p_x$ and $p_y$ be the number of variables in your sets $X$ and $Y$. $N$ is the sample size. You have obtained $m=\min(p_x,p_y)$ canonical correlations $\gamma_1, \gamma_2,...,\gamma_m$. Testing them usually goes as follows.

Given $\gamma_j$, its corresponding eigenvalue is $\lambda_j= \frac{1}{1-\gamma_j^2}-1$.

Wilk's lambda statistic for it is $w_j= \frac{1}{1+\lambda_j}w_{j+1}$. So, first compute $w_m$ which is $\frac{1}{1+\lambda_m}$, then compute $w_{m-1}$ using $w_m$, etc., backwards.

This statistic has approximately Chi-square distribution (under assumptions of normality and large $N$) with $df_j= (p_x-j+1)(p_y-j+1)$. To recalculate Wilk's into the Chi-square: $\chi_j^2= -\ln(w_j)(N-(p_x+p_y+3)/2)$.

So, substitute $\chi_j^2$ in Chi-square cdf distribution with $df_j$, subtract from 1, and have the p-value for correlation $\gamma_j$.

What does this p-value mean in fact? Nonsignificant p-value for $\gamma_1$ tells that all canonical correlations $\gamma_1$ through $\gamma_m$ are not significant (i.e. the hypothesis that they all are zero should not be rejected). Significant p-value for $\gamma_1$ and nonsignificant p-value for $\gamma_2$ tells that $\gamma_1$ is significant (likely to be nonzero in the population), while the rest $\gamma_2$ through $\gamma_m$ are all not significant; etc. Sometimes, p-value for $\gamma_{j+1}$ is lower than for $\gamma_{j}$. That should not be taken in the sense "$\gamma_{j+1}$ is more significant" because a more junior correlation cannot be more significant than more senior one. As said already, if $\gamma_{j}$ is not significant for you, all the remaining junior correlations must automatically be considered not significant too.

For an algorithm of CCA, look here.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • It is clear from your explanation the procedure to perform statistical significance analysis. Let's say you have obtained $m$ number of canonical correlations (CC), now the way you decide that how many of CC from $m$ of them are statistical significant is by assessing the associated p-value and this p-value is decided by selecting a significance threshold lets say p = 0.001, all p >0.001 should be discarded. My question is that, what are some ways to select p-value threshold, does multiple testing is a necessary step to select this threshold? I HIGHLY appreciate your answers! – Vendetta Nov 12 '18 at 03:44
  • 1
    @Vendetta, as the answer clearly says, you consider nonsignificant correlations the j-th and all j+1, j+2, etc, if the j-th was not significant. The alpha threshold is as in any test, we usually select .05 or .01 or .001. No multiple testing correction is needed. – ttnphns Nov 12 '18 at 09:20
  • 1
    Thank you for this explanation. One more thing, could you please share a reference about statistical significance of CC, specifically in relation to p-value. – Vendetta Nov 15 '18 at 03:34