I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \exists ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be solved with quadratic programming.
As an exercise, I tried two different methods that use gradient descent.
Perform gradient descent on $$ \|Y-XW \|_F^2 + \lambda \sum_{ij\in S}\max(-W_{ij},0) $$ where $S$ is a set of indices of $W$ where I want to impose a nonnegativity constraint. The Lagrange multiplier term will be positive when there are negative values in $ij$.
Perform gradient descent on $$ \|Y-XW \|_F^2$$ but at each iteration, project $W_{ij}$ to the nonnegative orthant. In other words, do $W_{ij}\leftarrow \max(W_{ij},0)$ at the end of each iteration.
Interestingly, the $W$s found using these gradient descent methods did not converge to that of the quadratic programming. Also, the gradient descent method was sensitive to the initial condition and converged to different $W$ that had different cost values.
Why can't these gradient descent methods find the global optimum of the convex problem?