Should we penalize dummy variables?

Question

Using glmnet we run the following regression

cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE)

where $y$ is the response variable and $x$ is the input matrix. The input matrix is a partitioned matrix of the form $x=[H,D]$ where

$H$ is an $n\times p_{1}$ matrix of tests results and
$D$ is an $n\times p_{2}$ matrix of patient-specific dummies. That is, we allow each patient to have a different $y$-intercept.

We specify intercept = FALSE because we are interested in the point estimates of the dummies. The columns of $H$ are highly correlated reflecting the fact that the tests often capture similar characteristics. As a consequence, the point estimates come with large standard errors and this is the reason why we regularize. We have $n=1000$, $p_{1}=50$, $p_{2}=70$.

Our question: Should we regularize the dummy variables?

As I see it (I am happy to be corrected), there are arguments for and against regularizing the dummies.

We should not regularize the dummies because the parameter instability is "caused" by $H$ and not $D$ (in fact, $D^{T}D$ is the identity matrix) and we should concentrate on regularizing $H$. By regularizing both $H$ and $D$, we may not regularize $H$ enough.
We should regularize the dummies as well because the presence of $D$ may already help to alleviate the problems caused by the multicollinearity of $H$. By regularizing both $H$ and $D$, we may not need to regularize as much as if we only regularized $H$.

The table below shows that, contrary to my expectation, lambda is larger if we regularize the dummies as well.

------------------------------------------------
Regression lambda   log(lambda)        mse
------------------------------------------------
     1       0.161     -1.829         0.910
     2       0.034     -3.396         0.869
------------------------------------------------

Where the codes for the two regressions are

Regression 1: cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE)
Regression 2: cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE, penalty.factor = p.fac)

Intercept is not a dummy variable, and we shouldn't penalize the intercept. You can do this by centering all the variables in the model and the response. Dummy variables resulting from the encoding of categorical variables absolutely require a penalty. — AdamO, Jun 05 '19 at 17:07
Thanks @AdamO. I understand, by keeping the full set of patient-dummies and specifying "intercept = FALSE", I implicitly penalize the intercept. Something we should not do. Could you explain in more detail why dummy variables absolutely require a penalty? — Aristide Herve, Jun 06 '19 at 08:20

Should we penalize dummy variables?

0 Answers0