Shrinkage operator for elastic net regularization

Question

Consider the elastic net problem

$$\min_x f(x) + \lambda_1 \Vert x \Vert_1 + \lambda_2 \Vert x \Vert_2^2$$

Is there a shrinkage operator for this objective function, similar to the soft thresholding operator for L1 regularization (which in this case would be $\text{sgn}(x) (\vert x \vert - \lambda_1)_+$)?

To elaborate:

In glmnet, for linear regression, given a design matrix $X \in \mathbb{R}^{n \times p}$ and a vector of observations $y \in \mathbb{R}^n$, the elastic net problem is of the form

$$\min_{\beta \in \mathbb{R}^p} \frac{1}{2n}\Vert y - X \beta \Vert_2^2 + \alpha \lambda \Vert \beta \Vert_1 + (1-\alpha)\frac{\lambda}{2}\Vert \beta \Vert_2^2$$

where $\alpha \in [0,1]$ is the elastic net parameter. When $\alpha = 1$ this is clearly equivalent to lasso linear regression, in which case the proximal operator for L1 regularization is soft thresholding, i.e.

$$\text{prox}_{\lambda \Vert \cdot \Vert_1}(v) = \text{sgn}(v)(\vert v \vert - \lambda)_+$$

My question is: When $\alpha \in [0,1)$, what is the form of $\text{prox}_{\alpha\lambda\Vert\cdot\Vert_1 + (1-\alpha)\frac{\lambda}{2}\Vert \cdot\Vert_2^2}$ ?

I think the question and notation are not clear to me. Are you saying `sign` instead of `sgn`?, also, the parenthesis are not in pairs ... — Haitao Du, Sep 25 '16 at 01:37
@hxd1011 "sgn" is widely used (see e.g. https://en.wikipedia.org/wiki/Sign_function) and the parentheses appear to be in pairs to me. — mark999, Sep 25 '16 at 01:49
sorry my bad ..., but i still do not understand what is $\text{sgn}(x) (\vert x \vert - \lambda_1)_+$? — Haitao Du, Sep 25 '16 at 01:50
That's the soft-thresholding operator for L1 regularization. Since $\Vert x \Vert_1$ is not differentiable with respect to $x$, we can first solve $\min_x f(x)$ and then use the soft thresholding operator to set the elements in $x$ that have absolute value less than $\lambda$ to 0 (in an iterative procedure). My question is: Is there an equivalent shrinkage operator for elastic net regularization? — user3294195, Sep 25 '16 at 01:54
do not understand "first solve min f(x) then use soft thresholding ...", we can directly solve it, see this [post](http://stats.stackexchange.com/questions/228763/regularization-methods-for-logistic-regression/228785#228785). we can directly write the derivative of $\|x\|$ as sign(x), most solver [works](http://stackoverflow.com/questions/39071824/is-there-any-way-to-extract-parameters-and-objective-function-for-each-iteration) — Haitao Du, Sep 25 '16 at 04:08
@hxd1011 I am not certain, but I believe the OP is effectively referring to the combination of [this](http://math.stackexchange.com/questions/416099/lasso-constraint-form-equivalent-to-penalty-form) and [this](http://math.stackexchange.com/questions/571068/what-is-the-difference-between-projected-gradient-descent-and-ordinary-gradient). — GeoMatt22, Sep 25 '16 at 05:06
@hxd The $(\cdot)_+$ operator simply returns its argument if its positive and is 0 otherwise. — Glen_b, Sep 25 '16 at 06:51
@GeoMatt22 my question is related to the second link in your comment. I've added more detail, in case the original question was unclear. — user3294195, Sep 25 '16 at 17:42
You'll likely be interested in [this monograph](https://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf). In particular, see **page 189** (as paginated on the page; it's page 71 of the pdf) where it states $$\mathrm{prox}_{\lambda f} (v) = \left(\frac{1}{1+\gamma \lambda} \right)\mathrm{prox}_{\lambda \|\cdot\|_1}(v)$$ where $\gamma$ is a slightly different parameterization than the one you're considering. — cardinal, Sep 25 '16 at 17:58
Thanks - I'm familiar with this paper, but for some reason, the proximal operator for elastic net defined in this way does not replicate the results of `glmnet` linear regression. It does match the function I've defined in this question though: http://stats.stackexchange.com/questions/236866/replicating-results-for-glmnet-linear-regression-using-a-generic-optimizer — user3294195, Sep 25 '16 at 22:29

score 3 · Accepted Answer · answered Jul 07 '17 at 21:46

The two tutorials, Ryu et al. 2016 and Parikh et al. 2013 give a nice overview of operator splitting methods involving gradient operator, proximal operator. Proximal operator is a special case of resolvent operator.

The resolvent of a relation A on $\mathbb{R}^n$ is defined as $$ R = (I-\lambda A)^{-1} $$

where $I$ is the identity function.

$\text{prox}_f(x)$, the proximal operator of function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is defined as $$ \text{prox}_{\lambda f}(x) = \arg\min_z(\lambda f(z) + 1/2\|z-x\|_2^2) $$

Here, we only consider $f$ is convex. We can reduce $\text{prox}_{\lambda f}(x)$ to $R_{\lambda \partial f}(x)$ by: let $z \in \text{prox}_{\lambda f}(x)$ \begin{align} z &= \arg \min_z \lambda f(z)+ 1/2 \|z-x\|_2^2 \\ 0 &\in \lambda\partial f(z) + z - x \\ x &\in \lambda\partial f(z) + z \\ z &= (I + \lambda \partial f)^{-1}(x) \end{align}

A nice property for proximal operator is that iteratively solving for $\text{prox}_{\partial f}(x)$ has linear convergence rate for strongly convex function. For $\alpha \neq 1$, the elastic net objective function is $(1-\alpha)\lambda/2$-strongly convex.

Applying the similar trick to the elastic net objective function, let $z \in prox_{elastic}(\beta)$, \begin{align} z &= \arg \min_z 1/2n\|Y - Xz\|_2^2 + \alpha \lambda\|z\|_1 + (1-\alpha)\lambda/2 \|z\|_2^2 + 1/2\|z -\beta\|_2^2 \\ 0 &\in \frac{X^TXz - Y^TX}{n} + \alpha\lambda \text{sgn}(z) + (1-\alpha)\lambda z+ z - \beta \end{align}

There is no closed form for $z$, unless we impose orthogonal design $X^TX = I$. Then, the first order condition is separable for different coordinates. \begin{align} 0 &\in z_i/n - (Y^TX)_i/n + \alpha \lambda \text{sgn}(z_i) + (1-\alpha)\lambda z_i + z_i -\beta_i \\ \beta_i - ((1-\alpha)\lambda + 1+ 1/n)z_i + (Y^TX)_i&\in \alpha\lambda \text{sgn}(z_i) \end{align} let $t_i = \frac{\beta_i + (Y^TX)_i }{(1-\alpha)\lambda + 1+ 1/n}$, then $$ z_i = sgn(t_i)(|t_i| - \frac{\alpha \lambda}{(1-\alpha)\lambda + 1+ 1/n})_+ $$

Iteratively applying the proximal operator to $\beta_{init}$, you will get $\beta^*$, where $\beta^* = \text{prox}_{elastic}(\beta^*)$, which implies $\partial f (\beta^*)= 0 $

$\text{prox}_{\alpha\lambda\|\beta_i\|_1 + (1-\alpha)\lambda \| \beta_i\|_2^2}$ is even easier to computer, \begin{align*} \beta_i - ((1-\alpha)\lambda +1 )z_i \in \alpha \lambda \text{sgn}(z_i) \end{align*} Again, let $t_i = \frac{\beta_i}{(1-\alpha)\lambda +1}$ $z_i = \text{sgn}(t_i)(|t_i| -\frac{\alpha\lambda}{(1-\alpha)\lambda + 1})_+ $.

Please let me know if I made any numerical mistake. Good luck

How would you solve this similar problem - https://math.stackexchange.com/questions/2595199? — Royi, Mar 13 '18 at 15:59

Shrinkage operator for elastic net regularization

1 Answers1