$\lambda_{\min}$ is $ \frac{1}{N}\max|\sum_{j}y^TX_j|$ rather than $\max|\sum_{j}y^TX_j|$?

Question

Given the objective function $$\text{argmin}_{c}(\Vert y - Xc \Vert_2^2 + \lambda\Vert c \Vert_1)$$

by taking the derivate $$\frac{\partial}{\partial c}(\Vert y - Xc \Vert_2^2 + \lambda\Vert c \Vert_1)$$

it can be proved, $\lambda_{\min} = \max|\sum_{j}y^TX_j|$ is the smallest $\lambda$ that makes the solution to the objective a zero vector.

However, when I did it in practice, I found this $\lambda_{\min}$ is not the smallest, here is my code:

x_1 = np.linspace(-1, 1, 30)
x_2 = 2 * x_1 + 0.01 * np.random.rand(30)
x_3 = np.random.uniform(-1, 1, 30)
y = 0.1 * x_1 + 0.5 * x_2 

l1 = abs(np.inner(x_1, y))
l2 = abs(np.inner(x_2, y)) 
l3 = abs(np.inner(x_3, y))
lambda_min = max(l1, l2, l3)

x = np.ones((30, 3))
x[:, 0], x[:, 1], x[:, 2] = x_1, x_2, x_3
lambda_n = lambda_min

while True:
    lambda_n = lambda_n * 0.95
    lasso_test = Lasso(alpha=lambda_n, fit_intercept=False)
    lasso_test.fit(x, y)
    temp = lasso_test.coef_
    if np.sum(temp) != 0:
        print('lambda_n = ', lambda_n)
        break
print('lambda_min = ', lambda_min)

lambda_min is far greater than lambda_n which is the smallest lambda that makes lasso_test.coef_ a zero vector.

It is only when I change

lambda_min = max(l1, l2, l3) / 30 # number of examples

will lambda_min and lambda_n be the same.

Given this, the formula for calculating $\lambda_{\min}$ should be $$\frac{1}{N}\max|\sum_{j}y^TX_j|$$

I believe there is nothing wrong with the theory, then what is wrong with my code?

nothing wrong with your code. the objective function sklearn-kit lasso does minimize is not $\frac{1}{2}||y-Xc||_2^2$ but $\frac{1}{2n}||y-Xc||_2^2$, hence $\lambda_{min}= \frac{1}{n} \max_j|X_j^Ty|$. — chRrr, Sep 11 '18 at 11:30
@ Thank you! I just overlooked the objective function in the documentation... — meTchaikovsky, Sep 11 '18 at 11:32

score 5 · Accepted Answer · answered Sep 11 '18 at 15:13

Turning my comment into an answer:

There is nothing wrong with your code. The additional $\frac{1}{n}$-factor of $\lambda_{min}$, you observed in python using the sklearn Lasso function, is an immediate consequence of the objective function used in the python.

The Lasso-function from the sklearn-kit in python minimizes $$\frac{1}{2n}||y−Xc||_2^2 + \lambda||c||_1$$ instead of $$\frac{1}{2}||y−Xc||_2^2 + \lambda||c||_1.$$

As a consequence the smallest value of $\lambda$ at which the Lasso will result in all $0$-coeffcients is given by $$λ_{min}=\frac{1}{n}\max_j|X_j^Ty|.$$

(Note: Arguments leading to the value of $\lambda_{min}$ can for example be found here: Lasso and Ridge tuning parameter scope)

In many practical scenarios, this value of $\lambda$ does give you the optimal value. So, one standard practice that I have come across is the `three-sigma rule', where you compute the standard deviation of $\frac{1}{n} X^T(y - Xc)$ and set $\lambda = 3 \mathrm{std}(\frac{1}{n} X^T(y - Xc))$. — Maxtron, Sep 11 '18 at 15:38

$\lambda_{\min}$ is $ \frac{1}{N}\max|\sum_{j}y^TX_j|$ rather than $\max|\sum_{j}y^TX_j|$?

1 Answers1