Optimization with orthogonal constraints

Question

I am working on computer vision, and have to optimize an objective function involves matrix $X$ and matrix $X$ is an orthogonal matrix.

$$maximize \ \ f(X)$$

$$ s.t \ \ X^T X=I$$ Where $I$ is the unit matrix. I am reading some paper and they talked complicated terminology like optimization over Grassmanian manfiold, Stiefel manifold. Basically, the gradient on those manifolds will be different than the regular gradients over Euclidean spaces.

Can you suggest any easy paper to read so that I can implement the gradient-based method on those manifolds. If there are any simple Matlab examples that illustrate the documents will be helpful

Thanks

Can you please provide more information about your problem at hand? As it stands it is very broad to answer. For example you might as well generate a random square matrix $Y$, get its SVD such that $Y= USV^T$ and use $U$ as a candidate solution for $X$. — usεr11852, Dec 20 '16 at 22:54
Do you really need the general treatment of Stiefel manifolds or would it suffice to have information about the (much simpler) constraint $X^\prime X=I$? In the latter case the problem reduces to moving along circles. The simplest example of this is in two dimensions, where the constraint defines the unit circle: the gradient at a point is directed along the tangent line at that point, but to follow the gradient you have to move along the circle rather than the tangent line. This idea generalizes. (cc @usεr11852) — whuber, Dec 20 '16 at 23:02
@whuber: +1 and thank you for cc::ing me. I obviously agree. With my example I just wanted to show that even a simplistic random search would suffice to give valid candidate solutions. — usεr11852, Dec 20 '16 at 23:06
A generic solution is to add the set of quadratic equality constraints to $f$ using [Lagrange multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier). (As an alternative to directly parameterizing the manifold.) — GeoMatt22, Dec 20 '16 at 23:14
@GeoMatt That's a good point. I suspect, though, that the direct parameterization might be much more efficient in higher dimensions. That's because any nonzero gradient of $f$ at a point $p$ (representing an orthonormal frame $X$) will determine two orthogonal linearly independent vectors $u$ and $v$ for which the geodesic along the gradient will consist of rotations of that frame through the direction from $u$ to $v$. That is so simple that it ought to permit direct application of any gradient-tracking methods for *unconstrained* problems. — whuber, Dec 20 '16 at 23:24
@whuber I am largely ignorant in the area of differential geometry, so you are far over my head here! Would "direct parameterization" be something like a [product of primitives](https://en.wikipedia.org/wiki/Orthogonal_matrix#Primitives)? Or are you suggesting no explicit parameterization is needed, but rather to somehow locally project the $f$ gradient onto the "manifold patch" given by the local $\partial{(X^TX)}=0$ constraint? (Perhaps something like [this](http://scicomp.stackexchange.com/q/14259/17742)?) — GeoMatt22, Dec 21 '16 at 04:02
@GeoMatt This might be more elementary than you think. The distinction is this: in the plane, $f$ is a function defined on the unit circle $x^2+y^2=1$. You could optimize it by extending it to a function of all $(x,y)$ in a neighborhood of the unit circle and using Lagrange multipliers to enforce the constraint $x^2+y^2=1$. Or, you could parameterize the unit circle as $(x,y)=(\cos(t),\sin(t))$, thereby making $f$ a periodic function of the variable $t$, and optimize it without constraints. Exactly the same method works with orthogonal matrices (and Grassmann manifolds). — whuber, Dec 21 '16 at 14:59
@whuber The idea of taking derivatives w.r.t. a direct parameterization seems clear. I was not certain of what the analogue for polar coordinates would be. Now I can see how this might work, for example using a [Cayley transform](https://en.wikipedia.org/wiki/Cayley_transform#Matrix_map). — GeoMatt22, Dec 22 '16 at 00:50
@whuber I tried to put an answer, based on a nice SIAM paper I found. — GeoMatt22, Dec 22 '16 at 22:02
The objective function involves the trace and since it relates to my research work I can not go into more details. — Lan Trần Thị, Dec 23 '16 at 17:23

score 10 · Answer 1 · edited Apr 13 '17 at 12:44

I found the following paper to be useful:

Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2), 303-353.

This paper has far more information than you may need, in terms of differential geometry and higher-order optimization methods.

However the information to answer your question is actually quite straightforward, and is in fact contained mostly in equations 2.53 and 2.70, which are both of the form $$\nabla{f}=G-X\Phi$$ where the nominal gradient $$G=\frac{\partial{f}}{\partial{X}}$$ is corrected to constrained gradient $\nabla{f}$ by subtracting off its projection $\Phi$ onto the current solution $X$. This is the normal to the manifold, similar to circular motion, and ensures the corrected gradient is tangent to the manifold.

Note: These formulas assume you are already on the manifold, i.e. $X^TX=I$. So in practice you will need to make sure your initial condition is appropriate (e.g. $X_0=I$, possible rectangular). You may also need to occasionally correct accumulated roundoff and/or truncation error (e.g. via SVD, see "ZCA" example below).

In the unconstrained case, $\Phi=0$, while in the constrained case $\Phi$ takes two forms: $$\Phi_{\mathrm{\small{G}}}=X^TG\implies\nabla_{\mathrm{\small{G}}}{f}=\left(I-XX^T\right)G$$ which corresponds to the "Grassmann manifold". The distinction here is that $\nabla_{\mathrm{\small{G}}}$ is insensitive to rotations, since for a rotation $Q^TQ=I$ and $X=X_0Q$, we have $XX^T=X_0X_0^T$.

The second form is $$\Phi_{\mathrm{\small{S}}}=G^TX\implies\nabla_{\mathrm{\small{S}}}{f}=G-XG^TX$$ which corresponds to the "Stiefel manifold", and is sensitive to rotations.

A simple example is approximation of a given matrix $A\in\mathbb{R}^{n\times{p}}$ with $p\leq{n}$ by an orthogonal matrix $X$, minimizing the least-squares error. In this case we have $$f[X]=\tfrac{1}{2}\|X-A\|_F^2=\tfrac{1}{2}\sum_{ij}\left(X_{ij}-A_{ij}\right)^2\implies{G}=X-A$$ The unconstrained case $\nabla{f}=G$ has solution $X=A$, because we are not concerned with ensuring $X$ is orthogonal.

For the Grassmann case we have $$\nabla_{\mathrm{\small{G}}}{f}=\left(XX^T-I\right)A=0$$ This can only have a solution is $A$ is square rather than "skinny", because if $p<n$ then $X$ will have a null space.

For the Stiefel case, we have $$\nabla_{\mathrm{\small{S}}}{f}=XA^TX-A=0$$ which can be solved even when $p<n$.

These two cases, Grassmann vs. Stiefel, essentially correspond to the difference between "PCA vs. ZCA whitening". In terms of the SVD, if the input matrix is $A=USV^T$, then the solutions are $X_{\mathrm{\small{G}}}=U$ and $X_{\mathrm{\small{S}}}=UV^T$. The PCA solution $X_{\mathrm{\small{G}}}$ only applies to a square input, i.e. $A$ must be the "covariance matrix". However the ZCA solution $X_{\mathrm{\small{S}}}$ can be used when $A$ is a "data matrix". (This is more properly known as the Orthogonal Procrustes problem.)

I don't think $X_0=I$ is a good starting point because then the derivative is 0. — user103828, Nov 25 '18 at 07:10

Optimization with orthogonal constraints

1 Answers1