13

$\newcommand{\diag}{\operatorname{diag}}$We have the problem: $$\min_{w\in\mathbb{R}^{d}}\left( \frac{1}{n}\sum_{i=1}^{n} \left( \langle w,x_{i}\rangle-y_{i} \right)^{2} +2\lambda||w||_1\right),$$ with the assumption that: $$\sum_{i=1}^nx_ix_i^T=\diag(\sigma_1^2,...,\sigma_d^2).$$

Is there a closed-form solution in this case?

I have that: $$(X^TX)^{-1}=\diag\left(\sigma_1^{-2},...,\sigma_d^{-2}\right),$$ and so I think the answer is: $$w\,^j=y\,^j\max\left\{0,1-\lambda \frac{n}{|y^j|}\right\},$$ for $y\,^j=\displaystyle\sum_{i=1}^n\frac{y_ix_i\,^j}{\sigma_i^2}$ , but I'm not sure.

jld
  • 18,405
  • 2
  • 52
  • 65
Arthur D.
  • 233
  • 1
  • 6

1 Answers1

11

I'm going to go through @cardinal's derivation of the closed form lasso solution when $X^T X = I$, found here, with minor modifications.

I'll be assuming that $\sigma^2_i > 0$ for all $i$. This is justified because if we have a $\sigma^2_i = 0$ then this tells us that the $i$th column of $X$ is all 0, and I think it's reasonable to exclude such a case. I'll let $X^T X = D$. Note that this also means that $X$ is full rank and the OLS solution $\hat \beta$ is uniquely defined.

I'm also going to modify your notation to better match that in the answer that I'm referencing. To that end, I shall be solving $$ \hat \beta_\lambda = \text{argmin}_{\beta \in \mathbb R^p } \frac 12 \vert \vert Y - X\beta\vert \vert^2_2 + \lambda \vert \vert \beta \vert \vert_1. $$

This is identical to your problem but I can add more details here if you'd like.

Following @cardinal's derivation, we have that we need to solve $$ \hat \beta_\lambda = \text{argmin } \frac 12 (Y^T Y - 2 Y^T X \beta + \beta^T X^T X \beta) + \lambda \vert \vert \beta \vert \vert_1 $$

$$ = \text{argmin } -Y^T X \beta + \frac 12 \beta^T D \beta + \lambda \vert \vert \beta \vert \vert_1. $$

Noting that the OLS solution is $\hat \beta = (X^T X)^{-1} X^T Y = D^{-1}X^T Y$, we have that $$ \hat \beta_\lambda = \text{argmin } -\hat \beta^T D \beta + \frac 12 \beta^T D \beta + \lambda \vert \vert \beta \vert \vert_1 $$

$$ = \text{argmin } \sum_{j=1}^p -\hat \beta_j \beta_j \sigma^2_j + \frac{\sigma^2_j}2 \beta_j^2 + \lambda | \beta_j|. $$

We are optimizing over each $\beta_j$ separately, so we can solve each term of this sum separately. This means we need to minimize $\mathcal L_j$ where $$ \mathcal L_j = -\hat \beta_j \beta_j \sigma^2_j + \frac{\sigma^2_j}2 \beta_j^2 + \lambda | \beta_j|. $$

Following a completely analagous argument to the linked answer, we find that $$ (\hat \beta_\lambda)_j = \mathrm{sgn}(\hat \beta_j) \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+. $$

Furthermore, $\hat \beta = D^{-1} X^T Y \implies \hat \beta_j = \frac{X_j^T Y}{\sigma_j^2}$ so we have that $$ \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+ = \frac 1 {\sigma^2_j} \left(|X_j^T Y| - \lambda\right)^+ $$

so it turns out that a predictor $X_j$ gets zeroed out exactly when it would if the design matrix was orthonormal, not just orthogonal. So we can see that in this case with $X^T X = D \neq I$, the variable selection is no different than if $X^T X = I$, but the actual coefficients $\hat \beta_\lambda$ are scaled according to the predictor variances.

As a final note, I will turn this solution into one that resembles yours which means we need to multiply $\hat \beta$ by something to get $\hat \beta_\lambda$. If $(\hat \beta_\lambda)_j \neq 0$ then we have that $$ (\hat \beta_\lambda)_j = \text{sgn}(\hat \beta_j) \left( \vert \hat \beta_j \vert - \frac{\lambda}{\sigma^2_j} \right) = \hat \beta_j - \text{sgn}(\hat \beta_j) \frac{\lambda}{\sigma^2_j} $$

$$ = \hat \beta_j \left( 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \right) $$

since $\frac{a}{\vert a \vert} = \text{sgn}(a)$.

Noting that $(\hat \beta_\lambda)_j = 0$ exactly when $$ \vert \hat \beta_j \vert - \frac{\lambda}{\sigma^2_j} \leq 0 \iff \vert \hat \beta_j \vert \leq \frac{\lambda}{\sigma^2_j} \iff 1 \leq \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \iff 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \leq 0, $$

we see that we could alternatively express $\hat \beta_\lambda$ as $$ (\hat \beta_\lambda)_j = \hat \beta_j \left( 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \right)^+. $$

So this is very close to what you had but not exactly the same.

I always like to check derivations like this against well-known libraries if possible, so here's an example in R:

## generating `x`
set.seed(1)
n = 1000
p = 5
sigma2s = 1:p
x = svd(matrix(rnorm(n * p), n, p))$u %*% diag(sqrt(sigma2s))

## check this
# t(x) %*% x

## generating `y`
betas = 1:p
y = x %*% betas + rnorm(nrow(x), 0, .5)

lambda = 2

## using a well-known library to fit lasso
library(penalized)
penalized(y, x, lambda1 = lambda)@penalized


## using closed form solution
betahat = lm(y ~ x - 1)$coef
ifelse(betahat > 0, 1, -1) * sapply(abs(betahat) - lambda / sigma2s, function(v) max(c(0, v)))
jld
  • 18,405
  • 2
  • 52
  • 65
  • Can you please elaborate why is $(\hat \beta_\lambda)_j = \mathrm{sgn}(\hat \beta_j) \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+$? I don't understand the $+$ of $\left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+$. Can't see why this is true (even in the linked answer). – Dennis Jun 06 '21 at 20:37
  • 1
    @Dennis I'll focus on the linked answer. Does it make sense why $\text{sgn}(\hat\beta_j) = \text{sgn}(\beta_j)$? Once we know they have the same signs, then we can go case by case. The first case is $\hat\beta_j > 0$ and in this case we know $\beta_j \geq 0$. We can then show that $\beta_j = \hat\beta_j - \gamma$, but since we also know that $\beta_j$ is non-negative, if $\hat\beta_j - \gamma < 0$ then this isn't feasible so we get the closest we can to it, which is just zero. $(\hat\beta_j - \gamma)^+$ is a shorthand for this. Then we do the same for the negative case – jld Jun 08 '21 at 13:58
  • Thank you for the help. The only part that confuses me is *"so we get the closest we can to it"* that you said. Are you sure this is the right explanation? It lucks some formality as far as I understand. Could you elaborate on that part, please? – Dennis Jun 13 '21 at 10:22