Derivation of the Group SCAD Solution

Question

Let $p_{\lambda}(\cdot)$ be the SCAD penalty. It is well known that the solution $\hat{\theta}$ to the optimization problem

$$ \min_{\theta} \Big\{ \frac 1 2 (z - \theta)^2 + p_{\lambda}(|\theta|) \Big\}$$

is given by

$$\hat{\theta} = \begin{cases} S(z, \lambda), & |z| \leq 2 \lambda \\ \frac{a-1}{a-2} S \left(z, \frac{a\lambda}{a-1} \right), & 2 \lambda < |z| \leq a \lambda \\ z, & |z| > a \lambda \end{cases}$$ where $\lambda > 0$ and $a > 2$, and $S(\cdot, \cdot)$ is the soft thresholding operator, given by

$$S(z, \lambda) = \begin{cases} z-\lambda, & z > \lambda\\ 0, & |z| \leq \lambda \\ z+ \lambda,& z < -\lambda \end{cases}.$$

Multiple papers which I have read (such as this one) state that an analogous solution holds in a very specific Group SCAD setting. Namely, the solution to the optimization problem

$$ \min_{\boldsymbol {\theta}} \Big\{ \frac 1 2 \lVert \boldsymbol z - \boldsymbol{\theta} \rVert^2 + p_{\lambda}(\lVert \boldsymbol {\theta} \rVert) \Big\}$$

is claimed to be

$$\hat{\boldsymbol{\theta}} = \begin{cases} S(\boldsymbol z, \lambda), & |\boldsymbol z| \leq 2 \lambda \\ \frac{a-1}{a-2} S \left(\boldsymbol z, \frac{a\lambda}{a-1} \right), & 2 \lambda < |\boldsymbol z| \leq a \lambda \\ \boldsymbol z, & |\boldsymbol z| > a \lambda \end{cases}$$

where, here, $S(\cdot, \cdot)$ is replaced by its multivariate counterpart $S(\boldsymbol z, \lambda) = S(\lVert\boldsymbol z\rVert, \lambda) \frac{\boldsymbol z}{\lVert \boldsymbol z \rVert}$.

I have not been able to find a proof of this fact. Papers such as the one linked above say the result is obvious. Could anyone please explain how to formally prove this?

Thank you in advance.

score 3 · Accepted Answer · answered Aug 02 '18 at 08:59

It requires some algebra to derive the result, but it's actually not complicated.

Derivative of vector norm

First, we need to remember the derivative of the vector norm. Let's start with the non-zero case $\mathbf{x} \neq \mathbf{0}$: $$ \frac{d}{d \mathbf{x}} \lVert \mathbf{x} \rVert = \frac{\mathbf{x}}{ \lVert \mathbf{x}\rVert } . $$ The vector norm is non-differentiable at zero, therefore we need to follow a different path if $\mathbf{x} = \mathbf{0}$. We need to consider the subdifferential $\partial(\lVert \mathbf{x}\rVert)$, which has the form \begin{align} \partial(\lVert \mathbf{x}\rVert) &= \{ \mathbf{v} \in \mathbb{R}^{p} \mid \lVert\mathbf{x}^\prime\rVert \geq \lVert\mathbf{x}\rVert + \mathbf{v}^\top (\mathbf{x}^\prime - \mathbf{x}), \forall \mathbf{x}^\prime \in \mathbb{R}^{p} \} \\ &= \{ \mathbf{v} \in \mathbb{R}^{p} \mid \lVert \mathbf{x}^\prime \rVert \geq \mathbf{v}^\top \mathbf{x}^\prime, \forall \mathbf{x}^\prime \in \mathbb{R}^{p} \} \\ &= \{ \mathbf{v} \in \mathbb{R}^{p} \mid \lVert \mathbf{v} \rVert \leq 1 \} \end{align} Thus, a subgradient vector $\mathbf{v}$ of $\lVert \mathbf{x} \rVert$ at $\mathbf{x} = \mathbf{0}$ needs to satisfy $\lVert \mathbf{v} \rVert \leq 1$. The derivative is thus defined as $$ \partial ( \lVert \mathbf{x} \rVert ) = \begin{cases} \{ \mathbf{x} / \lVert \mathbf{x} \rVert \} & \text{if $\mathbf{x} \neq \mathbf{0}$} , \\ \{ \mathbf{v} \in \mathbb{R}^{p} \mid \lVert \mathbf{v} \rVert \leq 1 \} & \text{if $\mathbf{x} = \mathbf{0}$} . \end{cases} $$

SCAD-penalized linear regression

The objective function we want to minimize is \begin{align} \arg\min_{\mathbf{\beta}_j} &\quad Q(\mathbf{\beta}) \\ Q(\mathbf{\beta}) &= \frac{1}{2} (\mathbf{y} - \mathbf{X} \mathbf{\beta})^\top (\mathbf{y} - \mathbf{X} \mathbf{\beta}) + \lambda \sum_{j=1}^J p_{\lambda,\gamma}( \lVert{ \mathbf{\beta}_j }\rVert ) , \end{align} where $J$ are the number of groups, and $\lambda > 0$ controls the amount of penalization. The SCAD penalty is defined as $$ p_{\lambda,\gamma}(u) = \begin{cases} \lambda u & \text{if $0 \leq u \leq \lambda$}, \\ - \frac{u^2 - 2 \gamma \lambda u + \lambda^2}{2(\gamma-1)} & \text{if $\lambda < u \leq \gamma \lambda$}, \\ \frac{(\gamma+1)\lambda^2}{2} & \text{if $u > \gamma \lambda$} . \end{cases} $$

Most paper, such as Breheny and Huang, 2015, assume orthonormal groups, i.e., $\left( \mathbf{X}^j \right)^\top \mathbf{X}^j = \mathbf{I}$, which I will assume here too.

Case 1 – LASSO

Let's consider the first case which corresponds to the LASSO. I use $\mathbf{X}^{j}$ to denote the columns of $\mathbf{X}$ corresponding to the $j$-th group, and $\mathbf{X}^{-j}$ to denote all columns not belonging to the $j$-th group. Setting the derivative to zero, we obtain \begin{align} \frac{ \partial Q(\beta) }{\partial \beta_j } &= -(\mathbf{X}^{j})^\top (\mathbf{y} - \mathbf{X}^{-j} \beta_{-j} - \mathbf{X}^{j} \beta_{j}) + \lambda \partial(\lVert \beta_j \rVert) \\ &= -(\mathbf{X}^{j})^\top (\mathbf{y} - \mathbf{X}^{-j} \beta_{-j}) + (\mathbf{X}^{j})^\top \mathbf{X}^{j} \beta_{j} + \lambda \partial(\lVert \beta_j \rVert) \\ &= -(\mathbf{X}^{j})^\top (\mathbf{y} - \mathbf{X}^{-j} \beta_{-j}) + \mathbf{I} \beta_{j} + \lambda \partial(\lVert \beta_j \rVert) \\ &= - \mathbf{z}_j + \mathbf{I} \beta_{j} + \lambda \partial(\lVert \beta_j \rVert) \\ &= \mathbf{0} , \end{align} where I made use of the orthonormality assumption and set $\mathbf{z}_j = (\mathbf{X}^{j})^\top (\mathbf{y} - \mathbf{X}^{-j} \beta_{-j})$, which is the part not depending on $\beta_j$.

If $ \beta_j \neq \mathbf{0}$, we substitute the derivative from above and obtain $$ - \mathbf{z}_j + \beta_{j} + \lambda \frac{\beta_j}{\lVert \beta_j \rVert} = \mathbf{0} , $$ which is equivalent to $$ \beta_{j} = \mathbf{z}_j \left( 1 + \frac{\lambda}{\lVert \beta_j \rVert} \right)^{-1} . $$ Now, we need to somehow replace $\lVert \beta_j \rVert$ on the right hand side by noting $$ \lVert \beta_{j} \rVert = \lVert \mathbf{z}_j \rVert \left( 1 + \frac{\lambda}{\lVert \beta_j \rVert} \right)^{-1} = \lVert \mathbf{z}_j \rVert - \lambda , $$ details of which are explained in another question wrt the Group LASSO. Substituting this result in the equation from above, we get $$ \beta_{j} = \mathbf{z}_j \left( 1 + \frac{\lambda}{\lVert \mathbf{z}_j \rVert - \lambda} \right)^{-1} = \frac{\mathbf{z}_j}{\lVert \mathbf{z}_j \rVert} \left( \lVert \mathbf{z}_j \rVert - \lambda \right) $$

For the case $ \beta_j = \mathbf{0}$, we need to consider the subdifferential and the Karush–Kuhn–Tucker (KKT) conditions, which state $$ \mathbf{0} \in -\mathbf{z}_{j} + \mathbf{\beta}_j + \lambda \mathbf{v} , $$ where $\mathbf{v}$ is any vector with $\lVert \mathbf{v} \rVert \leq 1$. Since $\mathbf{v} = \frac{\mathbf{z}_{j} }{\lambda}$, it implies $\lVert \mathbf{v} \rVert \leq 1 \Leftrightarrow \lVert \mathbf{z}_j \rVert \leq \lambda$.

Thus, the LASSO solution is $$ \mathbf{\beta}_j = \begin{cases} \frac{\mathbf{z}_j}{\lVert \mathbf{z}_j \rVert} \left( \lVert \mathbf{z}_j \rVert - \lambda \right) & \text{if $\lVert \mathbf{z}_j \rVert > \lambda$}, \\ \mathbf{0} & \text{if $\lVert \mathbf{z}_j \rVert \leq \lambda$}. \end{cases} $$

Case 2

For the second case of the SCAD penalty, the subdifferential is $$ \partial( p_{\lambda, \gamma} ( \lVert \beta_j \rVert)) = - \frac{\beta_j - \gamma \lambda \partial(\lVert\beta_j\rVert)}{\gamma - 1} . $$ Again, we need to distinguish between the non-zero and zero case and essentially repeat the same steps as for case 1. Let's begin with $ \beta_j \neq \mathbf{0}$, for which the partial derivative is $$ \frac{ \partial Q(\beta) }{\partial \beta_j } = - \mathbf{z}_j + \beta_{j} - \frac{\beta_j - \gamma \lambda \frac{\beta_j}{\lVert\beta_j\rVert}}{\gamma - 1} = \mathbf{0} . $$ Solving for $\beta_j$, leads to \begin{align} \beta_j &= \mathbf{z}_j \left( 1 - \frac{1 - \gamma\lambda \frac{1}{\lVert\beta_j\rVert} }{\gamma - 1} \right)^{-1} = \mathbf{z}_j \frac{\gamma - 1} {\gamma - 2 + \gamma\lambda \frac{1}{\lVert\beta_j\rVert}} , \\ \lVert \beta_j \rVert &= \left( \lVert{ \mathbf{z}_j }\rVert - \frac{\gamma \lambda}{\gamma - 1} \right) \frac{\gamma -1}{\gamma -2} = \frac{ \lVert{ \mathbf{z}_j }\rVert (\gamma - 1) -\gamma \lambda }{\gamma - 2} \end{align} Substituting the vector norm into the equation above, we obtain \begin{align} \beta_j &= \mathbf{z}_j \frac{\gamma - 1} {\gamma - 2 + \gamma\lambda \frac{\gamma - 2}{ \lVert{ \mathbf{z}_j }\rVert (\gamma - 1) -\gamma \lambda }} \\ &= \mathbf{z}_j \frac{\gamma - 1} {\frac{\gamma\lVert{ \mathbf{z}_j }\rVert (\gamma - 1) -\gamma^2 \lambda - 2\lVert{ \mathbf{z}_j }\rVert (\gamma - 1) +2\gamma \lambda + \gamma^2\lambda - 2 \gamma \lambda }{ \lVert{ \mathbf{z}_j }\rVert (\gamma - 1) -\gamma \lambda }} \\ &= \mathbf{z}_j \left( \frac{ \lVert{ \mathbf{z}_j }\rVert(\gamma - 1) - \gamma \lambda} {\gamma \lVert{ \mathbf{z}_j }\rVert - 2\lVert{ \mathbf{z}_j }\rVert} \right) \\ &= \frac{\mathbf{z}_j}{\lVert{ \mathbf{z}_j }\rVert} \left( \frac{ \lVert{ \mathbf{z}_j }\rVert(\gamma - 1) - \gamma \lambda} {\gamma - 2} \right) \\ &= \frac{\mathbf{z}_j}{\lVert{ \mathbf{z}_j }\rVert} \left( \lVert{ \mathbf{z}_j }\rVert - \frac{\gamma \lambda}{\gamma - 1} \right) \frac{\gamma -1}{\gamma -2} . \end{align}

If $ \beta_j = \mathbf{0}$, we consider the subdifferential with $\lVert \mathbf{v} \rVert \leq 1$ and obtain the KKT condition $$ \mathbf{0} \in - \mathbf{z}_j + \frac{ \gamma \lambda \mathbf{v}}{\gamma - 1} , $$ from which we can derive $$ \lVert \mathbf{v} \rVert \leq 1 \Leftrightarrow \lVert{ \mathbf{z}_j \rVert} \leq \frac{\gamma \lambda}{\gamma-1} . $$

Putting both cases together, the solution for the second case is $$ \mathbf{\beta}_j = \begin{cases} \frac{\mathbf{z}_j}{\lVert{ \mathbf{z}_j }\rVert} \left( \lVert{ \mathbf{z}_j }\rVert - \frac{\gamma \lambda}{\gamma - 1} \right) \frac{\gamma -1}{\gamma -2} & \text{if $\lVert \mathbf{z}_j \rVert > \frac{\gamma \lambda}{\gamma-1}$}, \\ \mathbf{0} & \text{if $\lVert \mathbf{z}_j \rVert \leq \frac{\gamma \lambda}{\gamma-1}$}. \end{cases} $$

Case 2

Finally, the third case of the SCAD penalty, which corresponds to no penalization, and yields the solution $$ \frac{ \partial Q(\beta) }{\partial \beta_j } = - \mathbf{z}_j + \beta_{j} = \mathbf{0} \Leftrightarrow \beta_{j} = \mathbf{z}_j . $$

Final Solution

The points where the three cases are joined are identical to those in the univariate case, which brings us to the final solution of SCAD-penalized linear regression: \begin{align} \beta_j &= \frac{\mathbf{z}_j}{\lVert{ \mathbf{z}_j }\rVert} F_{\lambda, \gamma} (\lVert{ \mathbf{z}_j }\rVert) \\ F_{\lambda, \gamma} (z) &= \begin{cases} S(z, \lambda) & \text{if $|z| \leq 2\lambda$}, \\ S\left( z, \frac{\gamma \lambda}{\gamma - 1} \right) \frac{\gamma - 1}{\gamma -2} & \text{if $2 \lambda < |z| \leq \gamma \lambda$}, \\ z & \text{if $|z| > \gamma \lambda$} . \end{cases} \end{align}

Derivation of the Group SCAD Solution

1 Answers1

Derivative of vector norm

SCAD-penalized linear regression

Case 1 – LASSO

Case 2

Case 2

Final Solution