Using column weights to achieve a different LASSO penalty per coefficient

Question

I am estimating a regression equation via LASSO and want to avoid penalizing certain coefficients for inference purposes. I have seen some discussion of differential shrinking/penalty terms (see related: Lasso penalty only applied to subset of regressors).

However, in a certain software implementation of LASSO I am using, it is difficult to modify the code to use a different penalty per feature (unlike, say, glmnet). However, I was inspired by the notion that we need to standardize features before running them in LASSO because the scale of a variable will affect the implicit penalty. In turn, it seems that on a standardized set of variables, we can manipulate the penalty by rescaling only certain variables.

For example, suppose we have two variables $x_1$ and $x_2$ standardized to mean 0, variance 1. We are using a common penalty $\lambda$ = 0.05.

We don’t want to penalize $x_1$ as much, so we multiply $x_1$ by 100. My intuition tells me this works out to an implicit penalty of $\lambda_1 = \lambda/100 = 0.0005$ on $x_1$. The penalty on $x_2$ is still $\lambda_2 = \lambda$.

I am not sure how to approach solving for the correct way to rescale, so I was wondering if anyone has considered pursuing a solution like this. Any assistance greatly appreciated.

air · Accepted Answer · 2018-02-12T04:48:54.720

To simplify the discussion below, I will first consider the case that all $\lambda_i > 0$, and then show how to deal with some unpenalized predictors.

Part 1: All predictors are penalized ($\lambda_i > 0$ for all $i$)

This case indeed works in exactly the way you described in your question.

Let $\Lambda = \text{Diag}(\lambda_1,\dotsc, \lambda_p)$ be the diagonal matrix, where $\lambda_i$ is the penalty you want applied to the i-th predictor.

Then you can write the LASSO problem (design matrix $X$, response $Y$), as follows:

$$ \min_{\beta \in \mathbb R^p} || Y - X\beta||^2_2 + || \Lambda \beta ||_1 $$

Now note that multiplying $X$ by $\Lambda^{-1}$ from the right means you multiply the i-th column by $1/\lambda_i$ and note:

$$ ||Y - X\beta||_2^2 + || \Lambda \beta||_1 = ||Y - X\Lambda^{-1} \Lambda\beta||_2^2 + || \Lambda \beta||_1= ||Y - X\Lambda^{-1} \tilde{\beta}||_2^2 + || \tilde{\beta}||_1 $$

In the last step I defined $\tilde{\beta}= \Lambda \beta$. Hence the original LASSO problem must be equivalent to:

$$ \min_{\tilde{\beta} \in \mathbb R^p} || Y - X\Lambda^{-1} \tilde{\beta}||^2_2 + || \tilde{\beta} ||_1 $$

This is a LASSO problem in which everyone gets a penalty of $1$. It is trivial to extend this so that everyone gets a penalty of $\lambda$ (then the $\Lambda$ entries would not represent the penalization of the i-th predictor, but the relative penalization); you might want this if you want to nest the above within a cross-validation based tuning of the regularization parameter.

When you predict afterwards you need to remember what scaling you used though! E.g. if you use the original $X$, then use $\beta = \Lambda^{-1} \tilde{\beta}$!

Part 2: Some unpenalized predictors (i.e. some $\lambda_i$ = 0)

Let's say you now want to solve a LASSO problem, in which some predictors, let's call them $Z$ are not penalized, i.e.:

$$ \min_{\beta, \gamma} || Y - X\beta - Z\gamma||^2_2 + || \Lambda \beta ||_1 $$

(Here I just split the full design matrix into two parts $X$ and $Z$ corresponding to penalized or unpenalized predictors.)

If your LASSO solver does not support unpenalized predictors, then as you mention in your comment, you could just use the technique from part 1 in which you essentially use a $\lambda_i$ really close to $0$ for unpenalized predictors. This would probably kind of work, except that it would be really bad from a numerical perspective, since some parts of $X\Lambda^{-1}$ would "blow" up.

Instead, there is a better way to do this by orthogonalization. You could proceed in the following steps:

Regress $Y \sim Z$, call the resulting coefficient $\tilde{\gamma}$ and also let $\tilde{Y}$ the residuals from this regression (i.e. $\tilde{Y} = Y - Z\tilde{\gamma}$).
Regress $X \sim Z$: For each column of $X$, say the $i$-th column, run the regression $X_i \sim Z$. Then call $\tilde{X}$ the design matrix whose $i$-th column is the residual from the $i$-th regression.
Run the following LASSO to get the fitted coefficient $\hat{\beta}$ (For this you will need the technique from Part 1.):

$$ \min_{\beta \in \mathbb R^p} || \tilde{Y} - \tilde{X}\beta||^2_2 + || \Lambda \beta ||_1 $$

Finally let $\hat{\gamma} = \tilde{\gamma} - (Z^TZ)^{-1}Z^TX\hat{\beta}$.

Then $(\hat{\beta}, \hat{\gamma})$ will be the solutions to the full LASSO problem.

Why does this work? This is a standard orthogonalization argument, similar to how the QR decomposition can be used to do linear regression. Essentially orthogonality (via the Pythagorean theorem -- I leave out the exact arguments as they are standard) allows us to split as follows (with $\hat{Y} = Y-\tilde{Y}$, $\hat{X} = X-\tilde{X}$):

$$ || Y-X\beta - Z\gamma||^2 = ||\tilde{Y} - \tilde{X}\beta||^2 + || \hat{Y} - \hat{X}\beta - Z\gamma||^2$$

So we want to solve:

$$ \min_{\beta, \gamma} \{ ||\tilde{Y} - \tilde{X}\beta||^2 + || \hat{Y} - \hat{X}\beta - Z\gamma||^2 + || \Lambda \beta ||_1 \}$$

Now if we optimize for fixed $\beta$ over $\gamma$ we will get the expression from Part 4 of the above procedure and furthermore we get rid of the $\beta$ appearing in the 2nd square above. What remains is only the LASSO from part 3. Putting everything together gives us the procedure outlined above.

Thanks for breaking it down, great answer! One thing I'm not sure about whether there's a way to deal with this if some lambdas are zero. I suppose you could just use a really big 1/lambda as weight, but am I correct in thinking that the weighting approach can't work for zero lambdas? — user2112, Feb 03 '18 at 20:03
@user2112 updated with the case of some lambdas equal to zero! — air, Feb 12 '18 at 04:49

Using column weights to achieve a different LASSO penalty per coefficient

1 Answers1

Part 1: All predictors are penalized ($\lambda_i > 0$ for all $i$)

Part 2: Some unpenalized predictors (i.e. some $\lambda_i$ = 0)