5

Given a matrix $X\in\mathbb{R}^{m\times n}$, I am trying to maximize $\text{Tr}(D^TX^TXD)$ over $D\in\mathbb{R}^{n\times l}$ ($n<l$) subject to $D^TD=I_l$, where $\text{Tr}$ denotes the trace, and $I_l$ denotes the identity matrix of size $l$.

More specifically, I am trying to find $$D^{*}=\arg\limits_{D}\max\text{Tr}(D^TX^TXD)\text{ subject to } D^TD=I_l.$$

The solution is that the matrix $D^*$ is given by the $l$ eigenvectors corresponding to the largest eigenvalues. However, I can't prove this.

I noticed that $D^TX^TXD$ is a real symmetric matrix, and thus we can decompose it to obtain $D^TX^TXD=Q\Lambda Q^T$, where $Q$ is an orthogonal matrix composed of eigenvectors of $D^TX^TXD$. I couldn't continue much from this. Any suggestions for this optimization problem?

amoeba
  • 93,463
  • 28
  • 275
  • 317

2 Answers2

4

Let us denote $X^\top X$ by $A$. By construction, it is a $n\times n$ square symmetric positive semi-definite matrix, i.e. it has an eigenvalue decomposition $A=V\Lambda V^\top$, where $V$ is the matrix of eigenvectors (each column is an eigenvector) and $\Lambda$ is a diagonal matrix of non-negative eigenvalues $\lambda_i$ sorted in the descending order.

You want to maximize $$\operatorname{Tr}(D^\top A D),$$ where $D$ has $l$ orthonormal columns. Let us write it as $$\operatorname{Tr}(D^\top V\Lambda V^\top D)=\operatorname{Tr}(\tilde D^\top\Lambda \tilde D)=\operatorname{Tr}\big(\tilde D^\top \operatorname{diag}\{\lambda_i\}\, \tilde D\big)=\sum_{i=1}^n\lambda_i\sum_{j=1}^l\tilde D_{ij}^2.$$ This algebraic manipulation corresponds to rotating the coordinate frame such that $A$ becomes diagonal. The matrix $D$ gets transformed as $\tilde D=V^\top D$ which also has $l$ orthonormal columns. And the whole trace is reduced to a linear combination of eigenvalues $\lambda_i$.

What can we say about the coefficients $a_i=\sum_{j=1}^l\tilde D_{ij}^2$ in this linear combination? They are row sums of squares in $\tilde D$, and hence (i) they are all $\le 1$ and (ii) they sum to $l$. If so, then it is rather obvious that to maximize the sum, one should take these coefficients to be $(1,\ldots, 1, 0, \ldots, 0)$, simply selecting the top $l$ eigenvalues. Indeed, if e.g. $a_1<1$ then the sum will increase if we set $a_1=1$ and reduce the size of the last non-zero $a_i$ term accordingly.

This means that the maximum will be achieved if $\tilde D$ is the first $l$ columns of the identity matrix. And accordingly if $D$ is the first $l$ columns of $V$, i.e. the first $l$ eigenvectors. QED.

(Of course this is a not a unique solution. $D$ can be rotated/reflected with any $l\times l$ orthogonal matrix without changing the value of the trace.)


This is very close to my answer in Why does PCA maximize total variance of the projection? This reasoning follows @whuber's comment in that thread:

[I]s it not intuitively obvious that given a collection of wallets of various amounts of cash (modeling the non-negative eigenvalues), and a fixed number $k$ that you can pick, that selecting the $k$ richest wallets will maximize your total cash? The proof that this intuition is correct is almost trivial: if you haven't taken the $k$ largest, then you can improve your sum by exchanging the smallest one you took for a larger amount.

amoeba
  • 93,463
  • 28
  • 275
  • 317
1

Define $W=X^TX$, and denote by $v_i$ a unit-norm eigenvector corresponding to its $i$-th largest eigenvalue.

By the variational characterization of eigenvalues, $$ v_1 = \underset{x,\|x\|_2=1}{\arg\max} ~ ~ x^T W x $$

Since you are looking for an orthogonal matrix, your next vector should be in a space orthogonal to $v_2$. Define $W^{(2)}=W-v_1v_1^TW$. It just so happens that $$ v_2 = \underset{x,\|x\|_2=1}{\arg\max} ~ ~ x^T W^{(2)} x $$ And so on.

Why are we sure that it is indeed the eigenvectors that maximize the sum? Can't we start with a different pair of vectors and then make up for it afterwards, as whuber pointed out?

If $X=U\Sigma V^T$ is the singular value decomposition of $X$, then $X^TX=W=V\Sigma^2 V^T$ is the eigendecomposition of $W$.

Define $X_l=U_l\Sigma_l V_l^T$, where $U_l, V_l$ are $U,V$ truncated to the first $l$ columns and $\Sigma_l$ to the leading $l\times l$ block.

By the Eckart-Young-Mirsky theorem we know that $$ \|X-X_l\|_F^2=\min_{A,rank(A)\leq l} \|X-A\|_F^2 $$ And it is easy to see that $\underset{A}{\arg\min} \|X-A\|_F^2=\underset{A}{\arg\max} \|A\|_F^2$ whenever $A$ is the result of projecting a matrix onto the span of $X$, so $$ \|X_l\|_F^2=\max_{A,rank(A)\leq l} \|A\|_F^2 $$

Now, observe that

  • $X_l^TX_l=V_l\Sigma_l^2V_l^T$, that is, $V_l^TX_l^TX_lV_l=\Sigma_l^2$
  • $\|X_l\|_F^2=\sum_{i=1}^l\sigma_i^2$

Therefore, $\mbox{tr}(V_l^TX_l^TX_lV_l)=\mbox{tr}(\Sigma_l^2)=\sum_{i=1}^l\sigma_i^2$ is optimal.

Finally, note that $V_l^TX_l^TX_lV_l=V_l^TX^TXV_l$

cangrejo
  • 2,121
  • 13
  • 22
  • 1
    This answer is unconvincing: how do we know that you can't do better by, say, *not* optimizing the value at the first step, thereby accepting a slightly suboptimal value, in return for doing much better in the second step? – whuber Dec 13 '17 at 16:44
  • @whuber Fair point. I'll try to improve it. – cangrejo Dec 13 '17 at 17:08