Statistical significance in canonical correlation analysis

Question

I do canonical correlation analysis between two multivariate datasets $X$ and $Y$. For each pair of canonical variates (x-y pair) I get the canonical correlation coefficient. How can I test its statistical significance?

score 9 · Answer 1 · edited Apr 13 '17 at 12:44

Let $p_x$ and $p_y$ be the number of variables in your sets $X$ and $Y$. $N$ is the sample size. You have obtained $m=\min(p_x,p_y)$ canonical correlations $\gamma_1, \gamma_2,...,\gamma_m$. Testing them usually goes as follows.

Given $\gamma_j$, its corresponding eigenvalue is $\lambda_j= \frac{1}{1-\gamma_j^2}-1$.

Wilk's lambda statistic for it is $w_j= \frac{1}{1+\lambda_j}w_{j+1}$. So, first compute $w_m$ which is $\frac{1}{1+\lambda_m}$, then compute $w_{m-1}$ using $w_m$, etc., backwards.

This statistic has approximately Chi-square distribution (under assumptions of normality and large $N$) with $df_j= (p_x-j+1)(p_y-j+1)$. To recalculate Wilk's into the Chi-square: $\chi_j^2= -\ln(w_j)(N-(p_x+p_y+3)/2)$.

So, substitute $\chi_j^2$ in Chi-square cdf distribution with $df_j$, subtract from 1, and have the p-value for correlation $\gamma_j$.

What does this p-value mean in fact? Nonsignificant p-value for $\gamma_1$ tells that all canonical correlations $\gamma_1$ through $\gamma_m$ are not significant (i.e. the hypothesis that they all are zero should not be rejected). Significant p-value for $\gamma_1$ and nonsignificant p-value for $\gamma_2$ tells that $\gamma_1$ is significant (likely to be nonzero in the population), while the rest $\gamma_2$ through $\gamma_m$ are all not significant; etc. Sometimes, p-value for $\gamma_{j+1}$ is lower than for $\gamma_{j}$. That should not be taken in the sense "$\gamma_{j+1}$ is more significant" because a more junior correlation cannot be more significant than more senior one. As said already, if $\gamma_{j}$ is not significant for you, all the remaining junior correlations must automatically be considered not significant too.

For an algorithm of CCA, look here.

It is clear from your explanation the procedure to perform statistical significance analysis. Let's say you have obtained $m$ number of canonical correlations (CC), now the way you decide that how many of CC from $m$ of them are statistical significant is by assessing the associated p-value and this p-value is decided by selecting a significance threshold lets say p = 0.001, all p >0.001 should be discarded. My question is that, what are some ways to select p-value threshold, does multiple testing is a necessary step to select this threshold? I HIGHLY appreciate your answers! — Vendetta, Nov 12 '18 at 03:44
@Vendetta, as the answer clearly says, you consider nonsignificant correlations the j-th and all j+1, j+2, etc, if the j-th was not significant. The alpha threshold is as in any test, we usually select .05 or .01 or .001. No multiple testing correction is needed. — ttnphns, Nov 12 '18 at 09:20
Thank you for this explanation. One more thing, could you please share a reference about statistical significance of CC, specifically in relation to p-value. — Vendetta, Nov 15 '18 at 03:34

Statistical significance in canonical correlation analysis

1 Answers1

Linked