2

I have X = (21,15) -> 21 observations, 15 variables; Y = (21,6) -> 21 observations, 6 variables. When I do CCA on X and Y, I get correlation coefficients of 1, but I know that it shouldnt happen for my data. How can I explain the overfitting of CCA? If the total variables are less than observations, CCA works fine. Why does this happen? Is there a mathematical proof?

BMErunner
  • 21
  • 2

1 Answers1

2

Yes, there's an interesting geometric interpretation that easily shows that if $n \le p + q$, some of the canonical correlations will become 1. In short and using your definitions of $X$ and $Y$, this has to do with the row-space of the data matrix $Z = [X,Y]^T$, which is over-determined when $p+q > n - 1$.

$n$: number of observations, and $p,q$: dimension of each set.

This is hard to visualize with your values for $n,p,q$, but I've created a small toy example in this link that explains this, with code and figures here.

I've answered a similar question before here.

idnavid
  • 764
  • 5
  • 14