Using glmnet we run the following regression
cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE)
where $y$ is the response variable and $x$ is the input matrix. The input matrix is a partitioned matrix of the form $x=[H,D]$ where
- $H$ is an $n\times p_{1}$ matrix of tests results and
- $D$ is an $n\times p_{2}$ matrix of patient-specific dummies. That is, we allow each patient to have a different $y$-intercept.
We specify intercept = FALSE
because we are interested in the point estimates of the dummies. The columns of $H$ are highly correlated
reflecting the fact that the tests often capture similar characteristics. As a consequence, the point estimates come with large standard errors and this is the reason why we regularize. We have $n=1000$, $p_{1}=50$, $p_{2}=70$.
Our question: Should we regularize the dummy variables?
As I see it (I am happy to be corrected), there are arguments for and against regularizing the dummies.
- We should not regularize the dummies because the parameter instability is "caused" by $H$ and not $D$ (in fact, $D^{T}D$ is the identity matrix) and we should concentrate on regularizing $H$. By regularizing both $H$ and $D$, we may not regularize $H$ enough.
- We should regularize the dummies as well because the presence of $D$ may already help to alleviate the problems caused by the multicollinearity of $H$. By regularizing both $H$ and $D$, we may not need to regularize as much as if we only regularized $H$.
The table below shows that, contrary to my expectation, lambda is larger if we regularize the dummies as well.
------------------------------------------------
Regression lambda log(lambda) mse
------------------------------------------------
1 0.161 -1.829 0.910
2 0.034 -3.396 0.869
------------------------------------------------
Where the codes for the two regressions are
Regression 1: cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE)
Regression 2: cvfit = cv.glmnet(x,y, alpha = 0, intercept = FALSE, penalty.factor = p.fac)