Lasso penalty only applied to subset of regressors

Question

This question has been asked before but there were no responses, so I thought I might ask again.

I'm interested in applying a Lasso penalty to some subset of the regressors, i.e. with objective function

$E = ||\mathbf{y} - \mathbf{X}_1 \boldsymbol{\beta}_1 - \mathbf{X}_2 \boldsymbol{\beta}_2||^2 + \lambda ||\boldsymbol{\beta}_1||_1$

where the Lasso is only applied to $\boldsymbol{\beta}_1$ but $\boldsymbol{\beta}_2$ is involved in the reconstruction.

Is there any theory behind this? Secondly, is there anyway to do this in sklearn?

user795305 · Accepted Answer · 2017-10-10T11:52:32.877

Let $H_2$ be an orthogonal projector onto the column space of $X_2$. We have that \begin{align*} & \min_{\beta_1, \beta_2} \left\{ \|y - X_1\beta_1 - X_2\beta_2\|_2^2 + \lambda \|\beta_1\|_1 \right\} \\ = & \, \min_{\beta_1, \beta_2} \left\{ \|H_2\left(y - X_1\beta_1 \right) - X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y - X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\ = & \, \min_{\beta_1 | \beta_2} \min_{\beta_2} \left\{ \|H_2\left(y - X_1\beta_1 \right) - X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y - X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\}, \end{align*} where \begin{align*} \hat\beta_2 & = \arg\min_{\beta_2} \left\{ \|H_2\left(y - X_1\beta_1 \right) - X_2 \beta_2\|_2^2 + \|\left(I-H_2\right)\left(y - X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\ & = \arg\min_{\beta_2} \left\{ \|H_2\left(y - X_1\beta_1 \right) - X_2 \beta_2\|_2^2 \right\} \end{align*} satisfies $X_2 \hat\beta_2 = H_2 (y - X_1 \beta_1)$ for all $\beta_1$ since $H_2 (y - X_1 \beta_1) \in \mathrm{col}(X_2)$ for all $\beta_1$. Considering in this sentence the case that $X_2$ is full rank, we further have that $$\hat\beta_2 = (X_2^T X_2)^{-1} X_2^T (y - X_1 \beta_1),$$ since $H_2 = X_2 (X_2^T X_2)^{-1} X_2$ in this case.

Plugging this into the first optimization problem, we see that \begin{align*} \hat\beta_1 & = \arg\min_{\beta_1} \left\{ 0 + \|\left(I-H_2\right)\left(y - X_1\beta_1 \right) \|_2^2 + \lambda \|\beta_1 \|_1 \right\} \\ & =\arg\min_{\beta_1} \left\{ \|\left(I-H_2\right)y - \left(I-H_2\right)X_1\beta_1 \|_2^2 + \lambda \|\beta_1 \|_1 \right\}, \tag{*} \end{align*} which can be evaluated through the usual lasso computational tools. As whuber suggests in his comment, this result is intuitive since the unrestricted coefficients $\beta_2$ can cover the span of $X_2$, so that only the part of space orthogonal to the span of $X_2$ is of concern when evaluating $\hat\beta_1$.

Despite the notation being slightly more general, nearly anyone who has ever used lasso is familiar with this result. To see this, suppose that $X_2 = \mathbf{1}$ is the (length $n$) vectors of ones, representing the intercept. Then, the projection matrix $H_2 = \mathbf{1} \left( \mathbf{1}^T \mathbf{1} \right)^{-1} \mathbf{1}^T = \frac{1}{n} \mathbf{1} \mathbf{1}^T$, and, for any vector $v$, the orthogonal projection $\left( I - H_2 \right) v = v - \bar{v} \mathbf{1}$ just demeans the vector. Considering equation $(*)$, this is exactly what people do when they compute the lasso coefficients! They demean the data so that the intercept doesn't have to be considered.

score 7 · Answer 2 · answered Oct 09 '17 at 20:55

Don't know that you need much "theory" behind such an approach. Penalized regression approaches (LASSO, ridge, or their hybrid elastic net regression) are tools for making bias-variance tradeoffs to improve model generalizability and performance. You certainly may choose to keep some variables unpenalized, as you propose for $\beta_2$, while others are penalized. For example, this paper examined the effectiveness of a vaccine by keeping the vaccination status unpenalized while incorporating other covariates with a ridge-regression L2 penalty. This approach avoided overfitting on covariates while allowing direct evaluation of the main predictor of interest.

Questions about implementations in specific programming environments are off-topic on this site. One general way to approach this issue, as in the glmnet package in R, is to include a predictor-specific penalty factor that multiplies the overall choice of $\lambda$ before evaluating the objective function. Predictors have default penalty factors of 1, but a predictor with a specified penalty factor of 0 would not be penalized at all and one with an infinite penalty factor would always be excluded. Intermediate values of penalty factors differing among predictors can provide any desired differential penalization among the predictors. I suspect that this approach can be incorporated somehow into the tools provided by sklearn.

+1 Since usually $\lambda$ is not specified, but found through some other means, it looks like one could remove the effect of $X_1$ on $X_2$ (by regressing $X_2$ against $X_1$), fit $\beta_2$ with $Y$ as the response and the adjusted $X_2$ as the regressors, and then perform the Lasso on the residuals using only $X_1$ as the regressors. The space of solutions for $\beta_1$ will be the same, but the parameter $\lambda$ may be multiplied by an (inconsequential) constant. The workaround for Ridge regression is [even simpler](https://stats.stackexchange.com/a/164546/919). — whuber, Oct 09 '17 at 22:56

Lasso penalty only applied to subset of regressors

2 Answers2

Linked