5

I am estimating a regression equation via LASSO and want to avoid penalizing certain coefficients for inference purposes. I have seen some discussion of differential shrinking/penalty terms (see related: Lasso penalty only applied to subset of regressors).

However, in a certain software implementation of LASSO I am using, it is difficult to modify the code to use a different penalty per feature (unlike, say, glmnet). However, I was inspired by the notion that we need to standardize features before running them in LASSO because the scale of a variable will affect the implicit penalty. In turn, it seems that on a standardized set of variables, we can manipulate the penalty by rescaling only certain variables.

For example, suppose we have two variables $x_1$ and $x_2$ standardized to mean 0, variance 1. We are using a common penalty $\lambda$ = 0.05.

We don’t want to penalize $x_1$ as much, so we multiply $x_1$ by 100. My intuition tells me this works out to an implicit penalty of $\lambda_1 = \lambda/100 = 0.0005$ on $x_1$. The penalty on $x_2$ is still $\lambda_2 = \lambda$.

I am not sure how to approach solving for the correct way to rescale, so I was wondering if anyone has considered pursuing a solution like this. Any assistance greatly appreciated.

user2112
  • 53
  • 4

1 Answers1

6

To simplify the discussion below, I will first consider the case that all $\lambda_i > 0$, and then show how to deal with some unpenalized predictors.

Part 1: All predictors are penalized ($\lambda_i > 0$ for all $i$)

This case indeed works in exactly the way you described in your question.

Let $\Lambda = \text{Diag}(\lambda_1,\dotsc, \lambda_p)$ be the diagonal matrix, where $\lambda_i$ is the penalty you want applied to the i-th predictor.

Then you can write the LASSO problem (design matrix $X$, response $Y$), as follows:

$$ \min_{\beta \in \mathbb R^p} || Y - X\beta||^2_2 + || \Lambda \beta ||_1 $$

Now note that multiplying $X$ by $\Lambda^{-1}$ from the right means you multiply the i-th column by $1/\lambda_i$ and note:

$$ ||Y - X\beta||_2^2 + || \Lambda \beta||_1 = ||Y - X\Lambda^{-1} \Lambda\beta||_2^2 + || \Lambda \beta||_1= ||Y - X\Lambda^{-1} \tilde{\beta}||_2^2 + || \tilde{\beta}||_1 $$

In the last step I defined $\tilde{\beta}= \Lambda \beta$. Hence the original LASSO problem must be equivalent to:

$$ \min_{\tilde{\beta} \in \mathbb R^p} || Y - X\Lambda^{-1} \tilde{\beta}||^2_2 + || \tilde{\beta} ||_1 $$

This is a LASSO problem in which everyone gets a penalty of $1$. It is trivial to extend this so that everyone gets a penalty of $\lambda$ (then the $\Lambda$ entries would not represent the penalization of the i-th predictor, but the relative penalization); you might want this if you want to nest the above within a cross-validation based tuning of the regularization parameter.

When you predict afterwards you need to remember what scaling you used though! E.g. if you use the original $X$, then use $\beta = \Lambda^{-1} \tilde{\beta}$!

Part 2: Some unpenalized predictors (i.e. some $\lambda_i$ = 0)

Let's say you now want to solve a LASSO problem, in which some predictors, let's call them $Z$ are not penalized, i.e.:

$$ \min_{\beta, \gamma} || Y - X\beta - Z\gamma||^2_2 + || \Lambda \beta ||_1 $$

(Here I just split the full design matrix into two parts $X$ and $Z$ corresponding to penalized or unpenalized predictors.)

If your LASSO solver does not support unpenalized predictors, then as you mention in your comment, you could just use the technique from part 1 in which you essentially use a $\lambda_i$ really close to $0$ for unpenalized predictors. This would probably kind of work, except that it would be really bad from a numerical perspective, since some parts of $X\Lambda^{-1}$ would "blow" up.

Instead, there is a better way to do this by orthogonalization. You could proceed in the following steps:

  1. Regress $Y \sim Z$, call the resulting coefficient $\tilde{\gamma}$ and also let $\tilde{Y}$ the residuals from this regression (i.e. $\tilde{Y} = Y - Z\tilde{\gamma}$).
  2. Regress $X \sim Z$: For each column of $X$, say the $i$-th column, run the regression $X_i \sim Z$. Then call $\tilde{X}$ the design matrix whose $i$-th column is the residual from the $i$-th regression.
  3. Run the following LASSO to get the fitted coefficient $\hat{\beta}$ (For this you will need the technique from Part 1.):

$$ \min_{\beta \in \mathbb R^p} || \tilde{Y} - \tilde{X}\beta||^2_2 + || \Lambda \beta ||_1 $$

  1. Finally let $\hat{\gamma} = \tilde{\gamma} - (Z^TZ)^{-1}Z^TX\hat{\beta}$.

Then $(\hat{\beta}, \hat{\gamma})$ will be the solutions to the full LASSO problem.

Why does this work? This is a standard orthogonalization argument, similar to how the QR decomposition can be used to do linear regression. Essentially orthogonality (via the Pythagorean theorem -- I leave out the exact arguments as they are standard) allows us to split as follows (with $\hat{Y} = Y-\tilde{Y}$, $\hat{X} = X-\tilde{X}$):

$$ || Y-X\beta - Z\gamma||^2 = ||\tilde{Y} - \tilde{X}\beta||^2 + || \hat{Y} - \hat{X}\beta - Z\gamma||^2$$

So we want to solve:

$$ \min_{\beta, \gamma} \{ ||\tilde{Y} - \tilde{X}\beta||^2 + || \hat{Y} - \hat{X}\beta - Z\gamma||^2 + || \Lambda \beta ||_1 \}$$

Now if we optimize for fixed $\beta$ over $\gamma$ we will get the expression from Part 4 of the above procedure and furthermore we get rid of the $\beta$ appearing in the 2nd square above. What remains is only the LASSO from part 3. Putting everything together gives us the procedure outlined above.

air
  • 1,333
  • 12
  • 15
  • 1
    Thanks for breaking it down, great answer! One thing I'm not sure about whether there's a way to deal with this if some lambdas are zero. I suppose you could just use a really big 1/lambda as weight, but am I correct in thinking that the weighting approach can't work for zero lambdas? – user2112 Feb 03 '18 at 20:03
  • @user2112 updated with the case of some lambdas equal to zero! – air Feb 12 '18 at 04:49