Gradient descent finds local minima for a problem that can be formulated as a convex problem

Question

I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \exists ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be solved with quadratic programming.

As an exercise, I tried two different methods that use gradient descent.

Perform gradient descent on $$ \|Y-XW \|_F^2 + \lambda \sum_{ij\in S}\max(-W_{ij},0) $$ where $S$ is a set of indices of $W$ where I want to impose a nonnegativity constraint. The Lagrange multiplier term will be positive when there are negative values in $ij$.
Perform gradient descent on $$ \|Y-XW \|_F^2$$ but at each iteration, project $W_{ij}$ to the nonnegative orthant. In other words, do $W_{ij}\leftarrow \max(W_{ij},0)$ at the end of each iteration.

Interestingly, the $W$s found using these gradient descent methods did not converge to that of the quadratic programming. Also, the gradient descent method was sensitive to the initial condition and converged to different $W$ that had different cost values.

Why can't these gradient descent methods find the global optimum of the convex problem?

What was the norm of your gradient at your final iterate? It should be very small as an indication of true convergence. Additionally, at the final iterate, it is helpful to check if the Hessian is positive-definite to check if you are truly at a minimum or just a saddle point. Note that the $\max(x,0)$ function is flat for $x < 0$, so it is possible that your loss function contains a large flat valley where the gradient is too small to induce a big enough step, and so you would need to increase your step size. — mhdadk, Sep 09 '21 at 16:09
@mhdadk Thanks. I checked the Hessian of the gradients, and they are positive-definite. The L2 norms of the gradients at the final iterations were 1e-3, and I used a learning rate of 2e-4. So I think I should be able to say I reached minima. Regarding your point on the flatness: Although $\max(x,0)$ has a flat area, my loss function contains a quadratic error term, which would put non-zero gradients in that area. So shouldn't I need not worry about the flatness of $\max(x,0)$? — CWC, Sep 09 '21 at 19:12
I'm not sure what you mean by "Hessian of the gradient". Could you clarify? I was referring to the hessian of the cost function evaluated at your final iterate. As for the flatness, it would be a concern if your Lagrange multiplier is large, thereby weighing the regularization term higher. Otherwise, it isn't really a concern. One thing you could try is to generate surface/contour plots for a much simpler version of your cost function. That is, decrease the dimensionality of $Y,X,$ and $W$ to 1 or 2 and create these plots to check if your cost function is indeed convex. — mhdadk, Sep 09 '21 at 19:41
@mhdadk Sorry, "Hessian of the gradient" was a typo. I meant the Hessian of the cost function at the final iterate. — CWC, Sep 09 '21 at 20:42

Gradient descent finds local minima for a problem that can be formulated as a convex problem

0 Answers0