8

I'm trying to understand how regularization works in term of projections onto a $l_*$ ball, and Euclidean projection onto the simplex.

I'm not sure I understand what we mean when we project the weight vector onto the $l_1$ or the $l_2$ balls.

I can understand the concept of $l_1$ regularization programmaticaly, as in, we go through each element in the weight vector, and apply signum(w) * max(0.0, abs(w) - shrinkageValue) where shrinkageValue = regularizationParameter * eta, thereby driving small weights to 0.

I guess I'm missing some math here, so my question is how do we translate the projection of the vector into the program I just described? How are regularization and vector projections connected?

Edit: I'm trying to go through this paper Efficient Projections onto the $l_1$ -Ball for Learning in High Dimensions

Bar
  • 2,492
  • 3
  • 19
  • 31

1 Answers1

11

Regularization and vector projections are connected through the idea of constrained optimization and the Karush-Kuhn (no relation)-Tucker conditions.

What are the KKT conditions?

Briefly, these state that, if $x$ is a solution to the problem "minimize $f(x)$ subject to $g(x) \le 0$", then $x$ is also a solution to the problem $\nabla f(x) = \lambda \nabla g(x)$ for some scalar $\lambda$. But this is equivalent to saying $\nabla f(x) - \lambda \nabla g(x) = 0$, which means that $x$ minimizes the unconstrained optimization problem "minimize $f(x) - \lambda g(x)$".

The intuition is that either:

  • $g(x) < 0$. In this case, $x$ is an "interior solution" so the gradient of $f$ must be zero at that point. (If it weren't zero, we could move a little bit in that direction from $x$, while maintaining $g(x) < 0$, and have a higher value for $f(x)$. Then we set $\lambda = 0$ and we're done.

  • Or, $g(x) = 0$. In this case, $x$ is on the edge of the possible solution space. Locally, this edge looks like a hyperplane orthogonal to the gradient $\nabla g(x)$, because the way you maintain the $g(x) = 0$ constraint is to not move up or down the gradient at all. But that means that the only direction the gradient $\nabla f$ could possibly point is the exact same direction as $\nabla g$--if it had any component that was orthogonal to $\nabla g$, we could move $x$ a little bit in that direction, stay on the orthogonal hyperplane $g(x) = 0$, and increase $f(x)$.

How the KKT conditions explain the relationship between constrained minimization and regularization

If $g(x) = |x| - c$ for some norm and some constant $c$, then the constraint $g(x) \le 0$ means that $x$ lies on a sphere of radius $c$ under that norm. And in the unconstrained formulation, subtracting $\lambda g(x)$ from the function you want to maximize is what ends up applying the regularization penalty: you're really subtracting $\lambda |x| + \lambda c$ (and the constant $\lambda c$ doesn't matter for optimization).

People often take advantage of this "duality" between unconstrained and constrained optimization. For an example that I could find quickly by Googling see On the LASSO and its dual.

Why are projections important here?

OK, so why is someone writing a paper on fast projections, though?

Basically, one way you can do general constrained optimization--"maximize $f(x)$ subject to $x \in X$"-- is to do the following:

  • Take any iterative algorithm for unconstrained maximization of $f(x)$
  • Start with a guess $x_0$
  • Take one step of the algorithm: $x_0^\prime \leftarrow step(x_0)$
  • Then project back onto the set $X$: $x_1 \leftarrow P_X(x_0^\prime)$.
  • And repeat until convergence.

For instance, this is how projected gradient descent is derived from ordinary gradient descent. Of course, optimizing your projection function $P_X$ is vitally important here.

Putting it all together

So, suppose that you want to solve the LASSO: $$\arg\min_\beta (\mathbf{y} - \beta^\prime \mathbf{X})^2 + \lambda ||\beta||_1$$

That's the unconstrained version. By the KKT conditions, adding the regularization term is equivalent to constraining the solution to lie in $||\beta||_1 \le c$ for some constant $c$. But that's just the $\ell_1$-ball with radius $c$!

So you could imagine solving this with projected (sub)gradient descent.* If you did, your $P_X$ function would be a projection onto the unit ball, and you want to make that fast.

*I don't think people actually do this, because there are more efficient ways. But those might use projections also. EDIT: as @Dougal points out, a more sophisticated variant of projected subgradient descent was good enough to write a paper about in 2008.

Ben Kuhn
  • 5,373
  • 1
  • 16
  • 27
  • 1
    The ISTA/[FISTA](http://mechroom.technion.ac.il/~becka/papers/71654.pdf) algorithm is basically (accelerated) projected subgradient descent, which is maybe not the most-favored LASSO algorithm but it's pretty good (and I think was state of the art around 2008 when that paper was published). – Danica May 01 '15 at 18:34
  • @Dougal: thanks for the reference! I've edited it in. – Ben Kuhn May 01 '15 at 19:38