I have X = (21,15) -> 21 observations, 15 variables; Y = (21,6) -> 21 observations, 6 variables. When I do CCA on X and Y, I get correlation coefficients of 1, but I know that it shouldnt happen for my data. How can I explain the overfitting of CCA? If the total variables are less than observations, CCA works fine. Why does this happen? Is there a mathematical proof?
Asked
Active
Viewed 409 times
1 Answers
2
Yes, there's an interesting geometric interpretation that easily shows that if $n \le p + q$, some of the canonical correlations will become 1. In short and using your definitions of $X$ and $Y$, this has to do with the row-space of the data matrix $Z = [X,Y]^T$, which is over-determined when $p+q > n - 1$.
$n$: number of observations, and $p,q$: dimension of each set.
This is hard to visualize with your values for $n,p,q$, but I've created a small toy example in this link that explains this, with code and figures here.
I've answered a similar question before here.

idnavid
- 764
- 5
- 14