Null distribution of subspaces similarity, or what is the distribution of $\mathrm{tr}(AA'BB')$?

Question

What is the distribution of $\mathrm{tr}(AA'BB')$ where $A$ and $B$ are two random matrices of $d \times k$ size with orthonormal columns?

Maybe the expected value is easier to compute? A fallback solution would be to use a simulation. What would be the most effective scheme? Typical values for $d$ would be around 2000, while $k$ ranges from ~10 to a few hundreds.

Below is a more detailed account of my problem and its context, how I ended up to ask this question and what I tried.

Context

I want to check if the principal components computed from a sample of stochastic process have converged. My current ideas involve comparing the subspaces spanned by the first $k$ principal components for given values of interest of $k$ either for several realisation of the stochastic process or for bootstraped principal components. My criterion for subspace similarity is $\mathrm{tr}(AA'BB') / k$ where $A$ and $B$ are matrices whose $k$ columns are bases of the two subspaces to compare. This criterion is easy to compute and behaves well except for the following property: as the dimension of the subspaces comes closer to the total dimension, the remaining angular space in which to point «wrong» directions shrinks. In order to build a more meaningful criterion, I thought to compare this score to the score obtained by comparing two random subspaces of dimension $k$.

My attempt

My first attempt at this was to consider that without loss of generality the first random subspace could have as basis the first $k$ vectors of the canonical basis.

A basis for the other random subspace can be built by first picking vectors from the canonical basis without replacement.

The resulting distribution would is simply those of an hypergeomtric law with parameter corresponding to $k$ draws in a total pool of $d$ vectors among which $k$ give a positive outcome (the $k$ first vectors of the canonical basis), where $d$ is the dimension of the total space, $\mathrm{H}(d, k, k/d)$.

Now, there is no reason that the vectors of the two bases should be either aligned or orthogonal. I suppose it is possible to remedy this by applying a random rotation $R$ and look at $\mathrm{tr}(AA'RBB'R')$. I am not sure how a rotation in $\mathbb{R}^d$ behaves but maybe using properties of the trace and the fact that $R' = R^{-1}$ it is possible to sort this out?

Note: Random orthogonal projectors are distributed according to the Wishart distribution. I do not know more about this however.

Related references:

Absil, Edelman & Koev, “On the largest principal angle between random subspaces”, Linear Algebra and its Applications, 2006, doi:10.1016/j.laa.2005.10.004, (I did not read this one)
Björck & Golub, “Numerical Methods for Computing Angles Between Linear Subspaces”, Mathematics of Computation, 1973
Ipsen & Meyer, “The Angle Between Complementary Subspaces”, The American Mathematical Monthly, 1995
Johnstone, “Multivariate analysis and Jacobi ensembles: Largest eigenvalue, Tracy-Widom limits and rates of convergence”, Ann. Statist., 2008 (too esoteric for me)
Liquet & Saracco, “Application of the Bootstrap Approach to the Choice of Dimension and the $\alpha$ Parameter in the {SIR} $\alpha$ Method”, Communications in Statistics Simulation and Computation, 2008
Wang, Wang & Feng, “Subspace distance analysis with application to adaptive Bayesian algorithm for face recognition”, Pattern Recognition , 2006
Zuccon, Azzopardi & van Rijsbergen, “Semantic Spaces: Measuring the Distance between Different Subspaces”, 2009
Hotelling, “Relations Between Two Sets of Variates”, Biometrika, 1936 (Possibly appropriate tests are given in paragraphs 11 and following in low dimension cases and extended in paragraph 15 to higher dimensions).
Bao, Hu, Pan & Zhou, “Test of independence for high-dimensional random vectors based on freeness in block correlation matrices”, Electronic Journal of Statistics, 2017

score 11 · Accepted Answer · edited Apr 13 '17 at 12:44

I have once asked a question that is essentially a special case of yours when $k=1$: Distribution of a scalar product of two random unit vectors in $\mathbb{R}^D$. @whuber gave an excellent answer, where he explains that a dot product equals $t=2u-1$, where $$u\sim \mathrm{Beta}((d-1)/2,(d-1)/2)$$ and $d$ is the dimensionality of the space. It follows that $\mathrm{Var}[t]=1/d$ (one can also show this directly, see the answer by @Student001 in the linked thread).

Your question is about a random variable $$w=\mathrm{tr}(AA^\top BB^\top) = \|A^\top B\|^2,$$ where $A$ and $B$ are $d\times k$ matrices with orthonormal columns. Note that $A^\top B$ is a $k \times k$ square matrix, where each element is a dot product between two random unit vectors in $d$ dimensions. Different vectors are not independent (because they have to be orthogonal), but with $d=2000 \gg k$ I hope this can be ignored. Then we can consider $A^\top B$ as a matrix of $k^2$ independent draws of $t$, and its squared norm is a random variable $$w=\sum_{i=1}^{k^2} t^2_i = \sum_{i=1}^{k^2} (2u_i-1)^2.$$

I think it would be difficult to obtain an analytical expression for the PDF of a sum of $k^2$ beta distributed random variables, but thanks to the central limit theorem, it will rapidly approach a normal distribution (see a related thread on math.SE). To specify this normal distribution, we need to compute its mean and variance. The mean is easy: mean of $t$ is zero, so mean of $t^2$ is equal to the variance of $t$, which is $1/d$. This means that $\mathbb E[w]=k^2/d$.

Computing the variance is a huge mess ~~that I started but cannot finish~~ that embarrassingly enough took me hours. Here are some auxiliary formulas that I derived by looking up the formula for the raw moments of beta distribution $$\mathbb E[X^q] = \prod_{r=0}^{q-1}\frac{\alpha+r}{\alpha+\beta+r}, \;\;\; X \sim \mathrm{Beta}(\alpha,\beta),$$ and plugging in $\alpha=\beta=(d-1)/2$: \begin{align} \mathbb E[u]&=1/2, \\ \mathbb E[u^2] &= \frac{d+1}{4d}, \\ \mathbb E[u^3] &= \frac{d+3}{8d}, \\ \mathbb E[u^4] &= \frac{(d+3)(d+5)}{16d(d+2)}. \end{align}

Using this, one can derive the variance of $w$ starting from $$\mathrm{Var}[w] = k^2 \mathrm{Var}[(2u-1)^2] = k^2 \mathbb E[(2u-1)^4]-k^2(\mathbb E[(2u-1)^2])^2.$$ I omit the tedious arithmetics, and skip directly to the answer: $$\mathrm{Var}[w] = k^2 \frac{2(d-1)}{d^2(d+2)}.$$ The conclusion is that asymptotically $$w \mathrel{\dot\sim} \mathcal N\left(k^2\frac{1}{d}, \; k^2 \frac{2(d-1)}{d^2(d+2)}\right) \mathrel{\dot\sim} \mathcal N\left(\frac{k^2}{d}, \; \frac{2k^2}{d^2}\right).$$

A quick simulation in Matlab confirms this result:

Similarity between subspaces

Here is the code I used to produce this figure ($d=2000$, $k=50$, number of Monte Carlo repetitions is $n=1000$; this runs for 17 seconds on my computer):

d = 2000;
k = 50;
n_iter = 1000;

tic
for rep = 1:n_iter
    A = randn(d,k);
    [A,~,~] = svd(A,0);  %// orthogonalizing
    B = randn(d,k);
    [B,~,~] = svd(B,0);  %// orthogonalizing

    w(rep) = sum(sum((transpose(A)*B).^2)); %// = trace(A*A'*B*B'), but faster!
end
toc

figure
[f, xi] = ksdensity(w);
h1 = plot(xi, f, 'LineWidth', 2);
hold on
x = min(w):(max(w)-min(w))/100:max(w);
mu = k^2/d;
sigma2 = k^2 * 2*(d-1)/d^2/(d+2);
h2 = plot(x, 1/(sqrt(2*pi*sigma2)) * exp(-(x-mu).^2/2/sigma2), 'r', 'LineWidth', 2);

title(['d = ' num2str(d) ', k = ', num2str(k)])
hh = legend({['Observed density (n = ' num2str(niter) ')'], 'Predicted density'});
legend('boxoff')

Interpretation of $w$

Cosines of principal angles between subspaces spanned by columns of $A$ and $B$ are given by singular values of $A^\top B$. Then the squares of these cosines are given by the eigenvalues of $A^\top BB^\top A$ or also of $AA^\top BB^\top$. So, geometrically, your trace is the sum of squared cosines of principal angles. If $A=B$, then all angles are zero and sum of squared cosines equals $k$. If $A \perp B$, then all angles are $90^\circ$ and sum of squared cosines is zero.

I like your approach of normalizing $w$ by $k$, i.e. of taking $w/k$ as the main measure of similarity. It obviously cannot exceed $1$, will be equal to $1$ when the subspaces coincide, and will be close to zero if they are randomly chosen. Indeed, $\mathbb E[w]=k^2/d$, which means that $\mathbb E[w/k] = k/d$. When $k \ll d$, this is close to zero.

Nice! You even finished the computation. It can be noted that the expectancy you derived is the same as those of the hypergeometric law I mentioned in my own attempt. I suppose this could be explained using the linearity of expectancy and trace. The variance is different however. — M. Toya, Mar 14 '15 at 07:24
Ah, yes, I first posted my answer after getting frustrated with not being able to derive the correct variance formula (that would fit to my simulation), but then finally succeeded and wrote an update. Interesting that the mean fits to your hypergeometric approach, I didn't notice that. Indeed, it's $k^2/d$ there as well. — amoeba, Mar 14 '15 at 15:52
By the way, do you have a geometric interpretation of your trace quantity? It is clearly some measure of "closeness" between $A$ and $B$, but what exactly does it measure? — amoeba, Mar 14 '15 at 15:54
Hmmm. It seems to me that cosines of [principal angles](http://en.wikipedia.org/wiki/Principal_angles) between subspaces spanned by columns of $A$ and $B$ are given by singular values of $A^\top B$. Then the squares of these cosines are given by the eigenvalues of $A^\top B B^\top A$ or also of $A A^\top B B^\top$. So, geometrically, your trace is the sum of squared cosines of principal angles. If $A=B$, then all angles are zero and sum of squared cosines equals $k$. If $A\perp B$, then all angles are 90 degrees and sum of squared cosines is zero. — amoeba, Mar 14 '15 at 23:39
Thanks for the bounty, @AlfredM! Glad I could help. I can imagine that I will need to use this result myself at certain moment. I am looking at subspaces overlap too (in neural recordings). — amoeba, Mar 17 '15 at 11:47

Null distribution of subspaces similarity, or what is the distribution of $\mathrm{tr}(AA'BB')$?

Context

My attempt

Related references:

1 Answers1

Interpretation of $w$

Linked