How should regularization parameters scale with data size?

Question

I am choosing parameter vectors $\beta$ and $\nu$ to minimize an expression of the form:

$$-\log{L(Y;X\beta,\nu)}+\frac{1}{2}\lambda {(\beta - \beta_0 )}^{\top} {(\beta - \beta_0 )}$$

where $\lambda$ is a regularization parameter, $\beta_0$ is a fixed constant, $L(Y;X\beta,\nu)$ is the likelihood of the observation vector $Y$ given $X\beta$ and $\nu$.(The actual likelihood is messy. However, it is the case that $\mathbb{E}Y=X\beta$.)

I have to solve many problems of this form. While it is computationally feasible to choose $\lambda$ by K-fold cross validation on an example problem, it is not computationally feasible to re-optimize $\lambda$ for every different $X$.

How should I scale $\lambda$ as the dimensions $n\times p$ of $X$ vary?

Does it matter that in my particular application I am optimizing subject to the constraints that $\beta\ge 0$ and $\beta^\top 1_p = 1$ (where also $\beta_0^\top 1_p = 1$)?

An answer to this question (partly clarified below) suggests that for linear regression, it may be optimal to have $\lambda=O_p(p)$ (on the order of $p$, in probability) as $p\rightarrow \infty$. If I've understood correctly, is it reasonable to assume this generalizes to non-Gaussian likelihoods?

score 3 · Answer 1 · answered Nov 27 '19 at 15:58

I'm sorry to have given you the wrong impression. The optimal value of $\lambda$ obtained through cross-validation increases roughly with $\frac{p}{n}$. In the example of the linked question, the change is so large that you can expect $\lambda_\text{CV}$ to decrease. However, the rough part was perhaps understated in my answer there. The optimal penalty depends on the number of observations ($n$), the number of parameters ($p$) and the actual values of the outcome ($y_i$). There is no scaling that I am aware of that takes all three of these into account appropriately (could make for an interesting paper one exists).

Consider the results of a simple simulation of $1 \leq \frac{p}{n} \leq 100$ in R below:

set.seed(1234)
MC <- 1000
l_CV <- numeric(MC)
ns   <- numeric(MC)
ps   <- numeric(MC)
for(i in 1:MC){
  n       <- round(runif(1, 3, 100))
  ns[i]   <- n
  p       <- round(n * runif(1, 1, 100))
  ps[i]   <- p
  X       <- matrix(rnorm(n * p), nrow = n, ncol = p)
  y       <- X %*% rnorm(p) + rnorm(n)
  l_CV[i] <- cv.glmnet(scale(X), y, alpha = 0)$lambda.min
  cat("\n", i, ": n =", n, "p =", p, "l =", l_CV[i])
}
plot(l_CV ~ I(ps / ns))
plot(l_CV ~ I(ps / ns), log = "y")

There is a clear trend towards an overall increase in $\lambda_\text{CV}$, but the variance is large and grows with the ratio. Even if with a logarithmic $y$-axis, only a rough relationship can be discerned. Perhaps I'm missing something obvious, but there even appears to be a bifurcation.

There are alternative methods to cross-validation though, such as: approximate generalized CV, L-curve, and the discrepancy principle. You may find that an implementation exists (or perhaps you can write one yourself) that reduces that computational load beyond the problematic amount you experienced with exact cross-validation.

According to the glmnet documentation, R already effectively scales $\lambda$ by $n$, since their minimand contains $\frac{1}{n}$ times the log-likelihood or RSS. So this exercise may not line up with my original question. (This is why I was cautious about what minimand you had in mind originally!) — cfp, Nov 27 '19 at 16:18
Golub, G.; Heath, M.; Wahba, G. (1979). "Generalized cross-validation as a method for choosing a good ridge parameter" give various analytic $O_p$ type results on quantities closely related to $\lambda_{CV}$, which is what led me to expect that there ought to be a result for $\lambda_{CV}$. — cfp, Nov 27 '19 at 16:18
Obviously any result would only hold for $n\rightarrow \infty$ and $p\rightarrow \infty$, and it would only hold in probability. — cfp, Nov 27 '19 at 16:19
I guess were something like this true it would also require $\frac{p}{n}\rightarrow \infty$ too. — cfp, Nov 27 '19 at 16:22

How should regularization parameters scale with data size?

1 Answers1

Linked