ridge regression parameter space

Question

In Ridge regression we estimate coefficients as

$\hat{\beta}|\lambda = \arg \min_\beta \|y - X\beta \|_2^2 + \|\lambda \beta\|_2^2 \qquad \qquad(1)$

for a given $\lambda$.

If I wanted the coefficients to lie in the ball with radius $R$, I would use quadratically constrained least squares,

minimize $\|y - X\beta \|_2^2 $

subject to $ \|\beta\|_2^2 \leq R^2, \qquad \qquad (2)$

and, e.g., via Lagrange multipliers minimize $\|y - X\beta \|_2^2 + \nu\left(\|\beta\|_2^2 - R^2\right)$ with respect to $\nu, \beta$ to get vectors $(\tilde{\nu}, \tilde{\beta})$ where the minimum is reached.

Now, if I wanted first, to constrain the parameters into the ball with radius $R$, how should I choose the $\lambda$ in Ridge regression setting so that it holds.

Is there a relationship between $R$ and $\lambda$ from (1), that tells me; you want your parameters to live in the ball with radius $R$, then choose this $\lambda$, which I fail to see?

My following question is - can one say when the solution of (1) $(\hat{\beta}, \lambda)$ is equaled to the solution of (2), $(\tilde{\beta}, \tilde{\nu})$, given the same data set and fixed R?

I am not sure why you want to go this way, but at least for orthonormal inputs you have the formula $\beta_{\text{ridge}} = \beta/(1+\lambda)$, so you can solve it knowing the original $\beta$ (see page 64 of elements of statistical learning, pdf freely available online) — seanv507, Jul 20 '17 at 17:22
No statistically motivated reason, I'm just thinking about some gemoetrical properties. Thank you for the reference. — reicja, Jul 20 '17 at 17:54
the principles used for lasso in my answer here (https://stats.stackexchange.com/questions/259177/) carry over to ridge — user795305, Jul 21 '17 at 03:45
@Ben thank you. I would also ask, can one say when the solution of (1) $\hat{\beta}, \lambda$ is equaled to the solution of (2), $\tilde{\beta}, \tilde{\nu}$, given the same data set and fixed $R$? — reicja, Jul 21 '17 at 08:56
I'll answer that soon. (I'm travelling a lot lately.) Could you edit your question to include this new question? — user795305, Jul 22 '17 at 15:19
As long as the constraint binds, the two formulations are equivalent in the sense that there's a one to one correspondence between $\lambda$ and $R^2$ such that formulation 1 and 2 have same solution. For any value of $R^2$ where the constraint $\|\boldsymbol{\beta}\|_2^2 \leq R^2$ doesn't bind, the corresponding $\lambda $ that gives the same solution is $\lambda = 0$. — Matthew Gunn, Jul 24 '17 at 00:05

user795305 · Accepted Answer · 2017-07-25T14:19:47.757

Define the Lagrangian $$L_C(\beta, \nu) = \frac{1}{2n} \|y-X\beta\|_2^2 + \nu (\|\beta\|_2^2 - C).$$ Notice that the constrained estimator \begin{align*} \tilde\beta_C & = \arg\min_{\beta \in \mathbb{R}^p \, : \, \|\beta\|_2^2 \leq C} \frac{1}{2n} \|y - X \beta\|_2^2 \\ & = \arg\min_{\beta \in \mathbb{R}^p} \left( \max_{\nu \ge 0} \left\{ \frac{1}{2n} \|y - X \beta\|_2^2 + \nu \left( \|\beta\|_2^2 - C \right) \right\} \right) \\ & = \arg\min_{\beta \in \mathbb{R}^p} \left( \max_{\nu \ge 0} \, L_C(\beta, \nu) \right). \end{align*} Also, notice the penalized estimator $$\hat\beta_\lambda = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \|y - X \beta\|_2^2 + \lambda \|\beta\|_2^2 = \arg\min_{\beta \in \mathbb{R}^p} L_K(\beta, \lambda), \tag{1}$$ for an arbitrary $K$. In particular, it holds for $K=C$.

As explained in my answer at Expressing the LASSO regression constraint via the penalty parameter, we have the relationship that $C = \|\hat\beta_\lambda\|_2^2$ and $\lambda = -\frac{\partial \frac{1}{2n} \|y - X \beta\|_2^2}{\partial \|\beta\|_2^2} |_{\tilde\beta_C}$.

The above formulations of the estimators shows that the constrained estimator is the first component of a saddle point of the Lagrangian $L_C$ (which is contrary to your comments about joint-minimization following equation $(2)$.) From this, we also see that the dual variable $\nu$ that you seek is the other component of the saddle point. We will leverage this interpretation to provide an answer to your questions.

We will assume that $C>0$ so that Slater's condition holds and $C < \|\beta^\mathrm{OLS}\|_2^2$ so that $C$ and $\tilde\nu_C$ are in bijective correspondence. When $C = 0$, we have that $\tilde\beta_C = 0$ and $\tilde\nu_C = \infty$. When $C \geq \|\beta^\mathrm{OLS}\|_2^2$, we have that $\hat\beta_C = \beta^\mathrm{OLS}$ and $\tilde\nu_c = 0$.

Since $C>0$, Slater's condition holds, so we know that there exists a saddle point of $L_C$, call it $(\tilde\beta_C, \tilde\nu_C)$. This means that $$\tilde\beta_C \in \arg\min_{\beta\in\mathbb{R}^p} L_C(\beta,\tilde\nu_C), \tag{2}$$ as we mentioned above, and $\tilde\nu_C \in \arg\max_{\nu \ge 0} L_C(\tilde\beta_C, \nu).$ (As an aside, it follows from these two relationships (after some algebra) that strong duality holds. Also, this second relationship can be used to determine the relationship between $C$ and $\lambda$.) Since $C < \|\beta^\mathrm{OLS}\|_2^2$, we know that $C$ and $\tilde\nu_C > 0$ are in bijective correspondence. Since the objective function $L_C(\beta, \tilde\nu_C)$ is strongly convex, it follows that the estimator is unique for each $C$. It also follows that the dual variable $\tilde\nu_C$ is uniquely determined by the estimator $\tilde\beta_C$. Therefore, by $(1)$ and $(2)$, we know that $\tilde\nu_C = \lambda$.

Great answer. Thank you for your time. This is really insightful, reading the linked answer shouldn't be skipped too. — reicja, Jul 26 '17 at 17:41

ridge regression parameter space

1 Answers1

Linked