Efficiency of ridge regression in under determined systems

Question

Imagine an underdetermined linear system, composed of N (continuous) labels and N samples, each have P features (with N < P):

$$\hat{\textbf{Y}}_{N \times 1} = \textbf{X}_{N \times P} \textbf{W}_{P\times 1} $$

and we are interested in finding the best weight matrix $W$ for this regression problem.

Since the system is underdetermined, a linear regression model has infinitely many feasible solutions and it is customary to pick the solution which also minimizes the solution $ ||\textbf{W}||^2$. This indeed makes sure that most of non-important features posses negligible weights.

Another approach to this problem is a ridge regression model. Again, there's a hyperparameter by which one can control the norm of the solution.

However when it comes to prediction, apparently, the ridge performs better in practice (the last comment here, as well as a personal experience of mine on training a system, confirm this statement). I'm interested to know why it's the case.

Min-norm solution is the limit of ridge with regularization parameter going to 0. If 0 is not the optimal value for your data, then tuning the regularization parameter can yield better results. — amoeba, Nov 20 '19 at 13:58
@amoebasaysReinstateMonica Thanks for the very informative question you addressed. I'm still digesting that tread though and it will surely take me a while to fully understand the details. — arash, Nov 22 '19 at 08:49
You might want to read this https://arxiv.org/abs/1805.10939 instead of reading that thread... It's a write-up of that whole discussion. — amoeba, Nov 22 '19 at 08:56
@amoebasaysReinstateMonica I surely do! I really appreciate it. — arash, Nov 22 '19 at 09:03
@amoebasaysReinstateMonica I'm quite puzzled! The KKT enforces the $\lambda$ in ridge and lasso to be positive. Where you address that in that paper? (btw, in figure2, there's a typo in explanation of subfigure e; in both cases, p is larger than n). — arash, Dec 11 '19 at 09:55
Hmm. How exactly does it enforce it? You can see here https://stats.stackexchange.com/questions/331264 a related discussion of the negative ridge (in the standard n>p regime). I think everything still works, at least for small negative values of lambda. That said, looking in the preprint, the ridge estimator defined in Eq (3) does not make sense for negative values of lambda when n — amoeba, Dec 11 '19 at 10:41

Efficiency of ridge regression in under determined systems

0 Answers0