R - Lasso Regression - different Lambda per regressor

Question

I want to do the following:

1) OLS regression (no penalization term) to get beta coefficients $b_{j}^{*}$; $j$ stands for the variables used to regress. I do this by

lm.model = lm(y~ 0 + x)
betas    = coefficients(lm.model)

2) Lasso regression with a penalization term, the selection criteria shall be the Bayesian Information Criteria (BIC), given by

$$\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$$

where $j$ stands for the variable/regressor number, $T$ for the number of observations, and $b_{j}^{*}$ for the initial betas obtained in step 1). I want to have regression results for this specific $\lambda_j$ value, which is different for each regressor used. Hence if there are three variables, there will be three different values $\lambda_j$.

The OLS-Lasso optimization problem is then given by

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

How can I do this in R with either the lars or glmnet package? I cannot find a way to specify lambda and I am not 100% sure if I get the correct results if I run

lars.model <- lars(x,y,type = "lasso", intercept = FALSE)
predict.lars(lars.model, type="coefficients", mode="lambda")

I appreciate any help here.

Update:

I have used the following code now:

fits.cv = cv.glmnet(x,y,type="mse",penalty.factor = pnlty)
lmin    = as.numeric(fits.cv[9]) #lambda.min
fits    = glmnet(x,y, alpha=1, intercept=FALSE, penalty.factor = pnlty)
coef    = coef(fits, s = lmin)

In line 1 I use cross validation with my specified penalty factor ($\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$), which is different for each regressor. Line 2 selects the "lambda.min" of fits.cv, which is the lambda that gives minimum mean cross-validation error. Line 3 performs a lasso fit (alpha=1) on the data. Again I used the penalty factor $\lambda$. Line 4 extracts the coefficients from fits which belong to the "optimal" $\lambda$ chosen in line 2.

Now I have the beta coefficients for the regressors which depict the optimal solution of the minimization problem

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

with a penalty factor $\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$. The optimal set of coefficients is most likely a subset of the regressors which I initially used, this is a consequence of the Lasso method which shrinks down the number of used regressors.

Is my understanding and the code correct?

You can use LATEX markup in your post, enclosed in dollar signs. `$\alpha$` becomes $\alpha$. Please make this, as it will make people more easily able to understand your question, and therefore answer it. — Sycorax, Feb 02 '15 at 16:44

Sycorax · Accepted Answer · 2018-04-04T16:26:25.323

From the glmnet documentation (?glmnet), we see that it is possible to perform differential shrinkage. This gets us at least part-way to answering OP's question.

penalty.factor: Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.

To fully answer the question, though, I think that there are two approaches available to you, depending on what you want to accomplish.

Your question is how to apply differential shrinking in glmnet and retrieve the coefficients for a specific value $\lambda$. Supplying penalty.factor s.t. some values are not 1 achieves differential shrinkage at any value of $\lambda$. To achieve shrinkage s.t. the shrinkage for each $b_j$ is $\phi_j= \frac{\log T}{T|b_j^*|}$, we just have to do some algebra. Let $\phi_j$ be the penalty factor for $b_j$, what would be supplied to penalty.factor. From the documentation, we can see that these values are re-scaled by a factor of $C\phi_j=\phi^\prime_j$ s.t. $m=C\sum_{j=1}^m \frac{\log T}{T|b^*_j|}$. This means that $\phi_j^\prime$ replaces $\phi_j$ in the below optimization expression. So solve for $C$, supply the values $\phi_j^\prime$ to glmnet, and then extract coefficients for $\lambda=1$. I would recommend using coef(model, s=1, exact=T).
The second is the "standard" way to use glmnet: One performs repeated $k$-fold cross-validation to select $\lambda$ such that you minimize out-of-sample MSE. This is what I describe below in more detail. The reason we use CV and check out-of-sample MSE is because in-sample MSE will always be minimized for $\lambda=0$, i.e. $b$ is an ordinary MLE. Using CV while varying $\lambda$ allows us to estimate how the model performs on out-of-sample data, and select a $\lambda$ that is optimal (in a specific sense).

That glmnet call doesn't specify a $\lambda$ (nor should it, because it computes the entire $\lambda$ trajectory by default for performance reasons). coef(fits,s=something) will return the coefficients for the $\lambda$ value something. But no matter the choice of $\lambda$ you provide, the result will reflect the differential penalty that you applied in the call to fit the model.

The standard way to select an optimal value of $\lambda$ is to use cv.glmnet, rather than glmnet. Cross-validation is used to select the amount of shrinkage which minimizes out-of-sample error, while the specification of penalty.factor will shrink some features more than others, according to your weighting scheme.

This procedure optimizes

$$ \underset{b\in \mathbb{R}^{m} }{\min} \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + \lambda \sum_{j=1}^{m} ( \phi_{j}|b_{j}| ) $$

where $\phi_j$ is the penalty factor for the $j^{th}$ feature (what you supply in the penalty.factor argument). (This is slightly different from your optimization expression; note that some of the subscripts are different.) Note that the $\lambda$ term is the same across all features, so the only way that some features are shrunk more than others is through $\phi_j$. Importantly, $\lambda$ and $\phi$ are not the same; $\lambda$ is scalar and $\phi$ is a vector! In this expression, $\lambda$ is fixed/assumed known; that is, optimization will pick the optimal $b$, not the optimal $\lambda$.

This is basically the motivation of glmnet as I understand it: to use penalized regression to estimate a regression model that is not overly-optimistic about its out-of-sample performance. If this is your goal, perhaps this is the right method for you after all.

+1 This is correct. I will also add that the regularization of regression can be viewed as a bayesian prior, i.e. maximum a posteriori (MAP) is regularized maximum liklihood (ML). Working in that framework gives oneself more flexibility in regularization should it be needed. — TLJ, Feb 02 '15 at 19:45
If I run `pnlty = log(24)/(24*betas); fits = glmnet(x,y, alpha=1, intercept=FALSE, penalty.factor = pnlty)` how do I then extract the regressor betas which correspond to the lambda that I specified, as the lambda is different for every risk factor? — Dom, Feb 04 '15 at 08:25
For what I want to do, can't I do this: `fits = glmnet(x,y, alpha=1, intercept=FALSE, penalty.factor = pnlty)` (pnlty is a vector specifying the penalty for each regressor) and select the coefficients for lambda.min, which I get from `coef(fits.cv, s = "lambda.min")`? Hence: `coef(fits, s=xxx)`? xxx being lambda.min — Dom, Feb 04 '15 at 15:46
@user777: I updated my original post. If you could please have a look if my understanding and procedure is now correct. — Dom, Feb 05 '15 at 10:33
I understand the difference between my initial optimization problem and the solution I suggested in my update. Is there a way to exactly solve the initial problem with the glmnet package? From my understanding, the glment equation for lasso doesn't incorporate the $T$ in front of the second term of the optimization problem. Besides, should the minimization not produce different beta values and not minimize the MSE w.r.t. $\lambda$? — Dom, Feb 06 '15 at 09:43
I meant that lasso tries to minimize the optimization function (see my original post) with respect to the betas. it will deliver a combination of betas for which the function is minimal. That's implied by $\underset{b\epsilon \mathbb{R}^{n} }{min}$. But glmnet minimises MSE with respect to $\lambda$. This is more of a general understanding issue from my side. I do not fully understand why lambda is varied to minimize MSE, although the $\beta$ coefficients are supposed to be varied to reach the minimum. Maybe I am comparing apples and oranges here. — Dom, Feb 06 '15 at 14:34
@Dom It dawned on me a bit too late that there's an obvious way to get exactly what you want using `glmnet`. See my revised answer. — Sycorax, Feb 06 '15 at 18:21
Beware of customizing the penalty separately for each predictor. That would amount to nothing more than stepwise variable selection in some cases. Penalized regression decreases mean squared error by assuming a very limited number of penalty parameters and borrowing information across predictors. — Frank Harrell, Feb 07 '15 at 00:30
@FrankHarrell Thanks for the comment! It seems that using different penalties to each predictor amounts to a Bayesian model which assumes a different prior for each parameter. That doesn't strike me as posing a unique hazard over Bayesian inference generally. Also, could you elaborate on how penalized regression borrows information across predictors? I'm not sure I fully grasp how that's the case in such a scenario. — Sycorax, Feb 07 '15 at 02:08
For example using a single $\lambda$ with a quadratic penalty function you are simultaneously shrinking all the coefficients. Allowing penalties to "float" without pre-specifying the relative penalties is analogous to doing Bayesian modeling where you update the priors after peeking at the data. — Frank Harrell, Feb 07 '15 at 03:47
@Dom Glad to help! Sorry that it took so long for me to get it there. This is not a use of `glmnet` that I'd encountered before. — Sycorax, Feb 09 '15 at 15:10

R - Lasso Regression - different Lambda per regressor

1 Answers1

Linked

Related