Sparsity-inducing regularization for stochastic matrices

Question

It is well-known (e.g. in the field of compressive sensing) that the $L_1$ norm is "sparsity-inducing," in the sense that if we minimize the functional (for fixed matrix $A$ and vector $\vec{b}$) $$f_{A,\vec{b}}(\vec{x})=\|A\vec{x}-\vec{b}\|_2^2+\lambda\|\vec{x}\|_1$$ for large enough $\lambda>0$, we're likely for many choices of $A$, $\vec{b}$, and $\lambda$ to have many exactly-zero entries in the resulting $\vec{x}$.

But if we minimize $f_{A,\vec{b}}$ subject to the condition that the entries of $\vec{x}$ are positive and sum to $1$, then the $L_1$ term doesn't have any effect (because $\|\vec{x}\|_1=1$ by fiat). Is there an analogous $L_1$-type regularizer that works in this case to encourage that the resulting $\vec{x}$ is sparse?

Could you elaborate on "then the $L_1$ term doesn't have any effect (because $||x||_1 = 1$ by fiat)"? — Cam.Davidson.Pilon, Aug 24 '12 at 12:27
@Cam.Davidson.Pilon: $x_i \geq 0$ and $\sum_i x_i = 1$ implies $\|x\|_1 = 1$. :) — cardinal, Aug 24 '12 at 13:41
Justin: Some more details might give a better chance at a useful answer. Here are some questions that immediately arise upon reading your description: (**1**) Where is the "stochastic matrix" in all of this? You only seem to describe a situation involving a stochastic *vector*. These could just be individual rows of your stochastic matrix, or other structure might become evident once more details are present. (**2**) You want the *probabilities* themselves to be sparse, or perhaps, sparse in some appropriate basis? If the first, why? (Is this some random walk on a weighted (sparse) graph?) — cardinal, Aug 24 '12 at 13:47
Why are you requiring that entries of $\vec x$ are *positive*? Should you instead be requiring that they be *nonnegative*? Also, have you considered re-parameterizing to eliminate the constraint (assuming you mean non-negative)? In other words, try $x_i = \frac{\exp(w_i)}{\sum_j \exp(w_j)}$ — jrennie, Aug 24 '12 at 17:03
@jrennie: Given the context, by *positive* Justin surely meant *nonnegative*. — cardinal, Aug 24 '12 at 17:44
Some details: (1) "Stochastic" for me means "adds up to one." I really want $\vec{x}$ and $\vec{b}$ to be matrices, but these things nicely separate column-by-column. Apologies for not making this clearer! (2) Yes, I'd like the probabilities to be sparse in the identity basis. It's for a particular application in computational geometry for which we want to solve "linear regression where the answer should be very sparse in the identity and the resulting output should sum to one." — Justin Solomon, Aug 27 '12 at 21:40
(And yes, we definitely want non-negative -- in fact, since we're seeking sparsity we prefer as many zeros as possible!) — Justin Solomon, Aug 27 '12 at 21:40
Hey there! Any chance you could give a reference for where things like "sparsity" are defined? I'm trying to read a machine learning textbook for a grad. school course (mathematician here), and I was having a lot of trouble just finding a clear definition of "sparsity" and "sparsity-inducing," in spite of everyone using the word. (This is possibly the most frustrating scenario when trying to read haha.) — Tanner Strunk, Aug 25 '17 at 01:28
Hi Tanner! Try taking a look at this Wiki page and reading about the L0 norm (not really a norm) and its relaxation to L1: https://en.wikipedia.org/wiki/Structured_sparsity_regularization My textbook also contains an introduction to L1 regularization :-) Search for information about sparsity in "compressive sensing" as well, for more search terms! Email me if you're still stuck and I'll try to find some more pointers for you. — Justin Solomon, Aug 25 '17 at 02:34

score 2 · Accepted Answer · answered Aug 25 '12 at 04:19

A general method for creating sparse solutions is via MAP estimation with a zero mean normal prior with an unknown variance.

$$p(x_i|\sigma_i^2)\sim N(0,\sigma_i^2)$$

If you then assign a prior to $\sigma_i^2$ which has a mode at zero then the posterior mode is usually sparse. The $L_1$ arises from this approach by taking an exponential mixing distribution.

$$p(\sigma_i^2|\lambda)\sim Expo\left(\frac{\lambda^2}{2}\right)$$

Then you get

$$\log[p(x_i|\lambda)]=-\lambda | x_i|+\log\left[\frac{\lambda}{2}\right]$$

Some alternatives are the generalised double pareto, half cauchy, inverted beta. In some sense these are better than lasso because they do not shrink large values. In fact I'm pretty sure the generalised double pareto can be written as a mixture of exponentials. That is we write $\lambda=\lambda_i$ and then place a gamma prior $p(\lambda_i|\alpha\beta)$. We get:

$$p(x_i|\alpha\beta)=\frac{\alpha}{2\beta}\left(1+\frac{|x_i|}{\beta}\right)^{-(\alpha+1)}$$

Note that I have included normalising constants, as they help choose good global parameters. Now if we apply the range restriction then we have a more complicated problem, as we need to renormalise over the simplex.

Another generic feature of sparsity inducing penalties is that they are not differentiable at zero. Usually this is because the left and right limits are of opposite sign.

This is based on the brilliant work by Nicolas Polson and James Scott on variance mean mixture representations which they use to develop TIRLS - a massive extension of least squares to a very large class of loss-penalty combinations.

As an alternative you could use a prior which is defined on the simplex, but has modes in the marginal distributions at zero. One example is the dirichlet distribution with all parameters between 0 and 1. The implied penalty would look like:

$$-\sum_{i=1}^{n-1}(a_i-1)\log(x_i) - (a_n-1)\log(1-\sum_{i=1}^{n-1}x_i)$$

Where $0<a_i<1$. However you would need to be careful in optimising numerically as the penalty has singularities. A more robust estimation process is to use the posterior mean. Although you lose exact sparseness you will get many posterior means that are close to zero.p

This seems like a very interesting idea, although we're not quite equipped to understand the details! If I understand correctly, the idea is that the $L_1$ prior comes from an assumption that the variables follow an exponential distribution about 0. So, we need a distribution centered at 0 that works better for our variables. But, there's no clear winner, right? Are there distributions over "positive variables that sum to 1" ? Thanks for your help! — Justin Solomon, Aug 27 '12 at 22:10
To get sparsity you need a distribution with a mode at zero. And the dirichlet distribution is over the simplex, which is precisely those distributions which sum to 1. Another general class is logistic-normal or logistic t where you have a normal/t distribution for $\log\left[\frac{x_i}{x_n}\right]$ — probabilityislogic, Aug 28 '12 at 00:03
Ah, the Dirichlet seems quite interesting in that it is on the simplex we are interested in, as you mention! It seems that the other two you mention might introduce some asymmetry on $x_n$, right? My collaborator and I will work through the energy function implied by Dirichlet tomorrow and will report back! Many thanks for your patient help thus far -- this is far from our usual field but if we can work it out the results may provide a considerable step forward in geometry processing! [And of course we'll provide you with due credit!] — Justin Solomon, Aug 28 '12 at 06:50

score 1 · Answer 2 · answered Aug 24 '12 at 17:31

1

Two options:

Use an $L_0$ penalty on $\vec x$. The obvious drawback is that this is nonconvex and hence difficult to optimize.
Reparameterize, $x_i = \frac{\exp(w_i)}{\sum_j \exp(w_j)}$ and use a penalty on the new (natural) parameter vector, $\|\vec w\|$. This will encourage events to be equally probable unless there is a good reason for them not to be.

answered Aug 24 '12 at 17:31

jrennie

394
1
5

Can you explain how your reparametrization encourage sparsity? It rather seems to *guarantee* quite the opposite. – cardinal Aug 24 '12 at 17:43
It encourages sparsity in $\vec w$ which corresponds to encouraging different entries of $\vec x$ to have the same value. – jrennie Aug 24 '12 at 17:56
Yes, I understand that. But, those values will not be zero. If we take the OP literally, this won't help and will actually "hurt" (in a sense). But, it is possible the OP is interested in sparsity with respect to some other basis, in which case, this would be one of them. :) – cardinal Aug 24 '12 at 18:01
That's why I provided two options in my answer---I think nonconvex penalty would be required to encourage zeros in $\vec x$. As you noted, Justin likely does not mean literally what he said. – jrennie Aug 24 '12 at 18:17
Yes, unfortunately we need sparsity in the identity basis. So in this case we'd want as many $w_i$'s as possible to equal $-\infty$. – Justin Solomon Aug 27 '12 at 21:43

score 1 · Answer 3 · answered Aug 24 '12 at 18:57

The premise of the question is only partly correct. While it is true that the $L_1$-norm is just a constant under the constraint, the constraint optimization problem might very well have a sparse solution.

However, the solution is unaffected by the choice of $\lambda$, so either there is a sparse solution or not. Another question is how to actually find the solution. A standard quadratic optimizer under linear constraints can, of course, be used, but popular coordinate descent algorithms cannot be used out-of-the-box.

One suggestion could be to optimize under a positivity contraint only, for different $\lambda$'s, and then renormalize the solution to have $L_1$-norm 1. A coordinate descent algorithm should, I believe, be easily modifiable to compute the solution under a positivity constraint.

Han Zhang · Answer 4 · 2018-09-07T15:43:23.713

0

I can think up three methods.

Bayesian method: introducing a zero-mean prior distribution and use type II likelihood to estimate the parameters and hyper parameters.
Use $\Vert\cdot\Vert_{\infty}$ as regularization instead. This is not differentiable though. You can use a high-order norm to approximate it.
Use $-\sum_{i=1}\log x_i$.

In fact, the first and the third methods are the same.

edited Sep 07 '18 at 15:43

answered Sep 07 '18 at 15:24

Han Zhang

1
1

Sparsity-inducing regularization for stochastic matrices

4 Answers4