Does minimum norm solution guarantee generalization in the underconstrained case (in the statistical learning sense)?

Question

Recall that pseudo-inverse can be characterized as follows:

Solve $$ \| w \|^2 $$

subject to:

$$ Xw = y $$

thus it is plausible since its a constrained optimization problem that the solution generalizes (as the traditional wisdom of statistical learning theory). If that is the case does it generalize in the following sense (empirical risk converges to expected risk):

$$ \lim_{ N \rightarrow \infty} \hat E_{S_N}[Loss(f_{w^+}(x),y) ] - E_{p(x,y)}[Loss(f_{w^+}(x),y) ] = 0$$

where $f_{w^+}(x) = \langle w^+, x \rangle$, $S_{N} = \{ (x_i,y_i) \}^N_{i=1}$ and $\hat E_{S_N}[Loss(f(x),y)] = \frac{1}{N} \sum^N_{i=1} Loss(f(x),y) $.

My thoughts:

Assume we find the predictor/hypothesis we want via empirical risk minimization (by training on the training set):

$$ \| Xw - y \|^2 $$

and the solution we find is the pseudo-inverse and thus its minimum norm i.e. $w^+ = X^+ y$ (assume full row rank and that the system is under-constrained/overparametrized). If we solved a similar problem but we used Tikhonov Regularization (we would have generalization via stability). In that case we are solving:

$$ \| Xw - y \|^2 + \lambda \| w\|^2 $$

so assuming X is full row rank then of course the first term $\| Xw - y \|^2 = 0$ because we can find a linear combination that makes it zero. Now we need to find a solution that minimizes $\|w\|^2$ the L2 norm of the solution. My confusion is if Tikhonov regularization will find the same minimum norm solution as found by the pseudo-inverse. Since both minimize the squared distance to the true signal $y$ it should be clear that its possible to make that zero. However, its not clear if they coincide in solution. My guess is that Tikhonov does not coincide with the pseudo-inverse solution unless $\lambda = 0$. If that is the case then its not clear at all if the stability argument goes through to the psueod-inverse solution and therefore its unless if it has any generalization guarantees.

Tikhonov vs Pseudo-inverse solutions

Looking at the two equations for $w$ it seems to me that Tikhonov is different from Pseudo-Inverse:

$$ (X^TX)^{+}X^Ty = w_+$$

vs tikhonov:

$$ (X^TX + \lambda I )^{-1}X^Ty = w_{\lambda} $$

Since they don't look equal I'd assume that pseudo-inverse might not generalize.

Note generalization in this question is defined as train and test/expect risk converging to the same value as # data goes to infinity.

You're mixing up linear regression with solving linear systems - in the case of solving linear system there is a fixed problem, so talking about generalization doesn't really make sense — Jakub Bartczuk, Feb 25 '18 at 14:20
@JakubBartczuk no I'm not confusing anything. Your not realizing that $Xw = y$ leads to zero error $ \| Xw - y \|^2 $ of course. Generalization of course makes sense to talk about it when there is a test set in the classical statistical learning setting. My question though did need some re-phrasing to not confuse people with not the proper background and not seeing the connections. — Charlie Parker, Feb 27 '18 at 20:38
also, there was a similar question recently, see answers here: https://stats.stackexchange.com/a/329903/36041 — Aksakal, Feb 27 '18 at 21:07
You wrote "so assuming X is full row rank then of course the first term $\| Xw - y \|^2 = 0$ because we can find a linear combination that makes it zero. Now we need to find a solution that minimizes $\|w\|^2$ the L2 norm of the solution." There's no reason this must be true. The optimal $w$ might have a non-zero first term $\| Xw - y \|^2$ since a decrease in $\lambda\|w\|^2$ might more than offset some increase in $\| Xw - y \|^2$. — Mark L. Stone, Feb 27 '18 at 21:09
@MarkL.Stone yes that could be right (probably is). I guess I'm obsessing to think about if pseudo-inverse generalizing or not. I want to know if pseudo-inverse generalizes (since it seems to be solving a constrained optimization "implicitly"). — Charlie Parker, Feb 27 '18 at 21:13
@Aksakal they are not. One could have too many parameters (say columns) and not span the whole row space. I hope it didn't seem I was equating them. Btw, thanks for the link. — Charlie Parker, Feb 27 '18 at 21:14
Ridge regression with $\lambda > 0$ does not ever correspond to a solution with with squared error 0. This is because it is equivalent to minimizing squared error on a constrained set of parameters (depending on $\lambda$) which does not contain points where squared error is $0$. — Jonny Lomond, Feb 27 '18 at 21:20
@JonnyLomond Sure it can: let all the labels $y = 0$, then no matter the $\lambda$ using $w = 0$ will get $\lVert X w - y \rVert = 0$. I *think* this is the only case where it can happen, but not sure. — Danica, Feb 27 '18 at 21:26
@Aksakal what ur missing is that I am assuming (for convenience, we don't need this assumption) that its full **row** rank. Yes we do have many columns. — Charlie Parker, Feb 27 '18 at 21:29
If your sample size grows why would you do psudo inverse? Maybe it's a different question, but why wouldn't you then use OLS? — Aksakal, Feb 27 '18 at 21:34
Did you really ask this without first seeing the discussion in the thread linked above? I find this hard to believe. Anyway, I think the accepted answer in that thread answers your question. — amoeba, Feb 27 '18 at 21:41
@Aksakal 1) I don't know what OLS is. 2) I am not trying to solve anything, I am trying to theoretically characterize the generalization (or not) properties of the minimum norm solution (since its a solution that seems to be constrained so classical statistical learning intuition would suggest it does generalize as $N$ goes to infinity). Going to N goes to infinity is just the way generalization is defined $ \lim_{N \rightarrow \infty} \hat E_{S_N}[l(f(x),y)] - E_{p(x,y)}[l(f(x),y)] = 0$. — Charlie Parker, Feb 27 '18 at 21:41
@amoeba I have just started reading that other post, which seem to have a lot of activity. I did not see that post when I originally posted my question (btw, I am not looking for an empirical answer, I am looking for a theoretical answer in the context of statistical learning theory, things like VC dimension, stability, etc). — Charlie Parker, Feb 27 '18 at 21:42
@amoeba lol I know. XD Though the other answer/question is written in more traditional statistics setting like $R^2$ that make it a little hard to read for me...I will do my best... — Charlie Parker, Feb 27 '18 at 21:43
@Pinocchio, if you have a large sample psudo inverse will render exactly the same solution as least squares, see e.g. Section 5 in http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-macausland-pseudo-inverse.pdf that's why it doesn't make a sense to do it when you have more rows than columns, for instance, where OLS works just fine — Aksakal, Feb 27 '18 at 21:49
@alex-r just answered your question regarding the equality 'tikhonov vs minimum norm', which is correct when you consider the *limit* of the Tikhonov regularization. But you had an additional question regarding the stability and coincidence of solution? Could you elaborate on that. What do you mean by coinciding solution? Of course Tikhonov $\neq$ minimal norm, not? — Sextus Empiricus, Feb 27 '18 at 21:50
@Aksakal that is a 10 page paper, do you want to refer to me where you want me to read? I am aware that least squares and pseudo-inverse are equivalent. My question is about the expected risk and empirical risk converging or not. — Charlie Parker, Feb 27 '18 at 21:51
@MartijnWeterings it might have been a mistake to mention Tikhonov at all now that I am reflecting (it was sort of to provide context/intuition). My question is really simple, does psuedo-inverse have the property that empirical and expected risk converge as N goes to infinity or not? I am just interested in knowing if psuedo-inverse has any generalization properties in the statistical learning setting. — Charlie Parker, Feb 27 '18 at 21:53
@Aksakai write " large sample psudo inverse will render exactly the same solution as least squares ... that's why it doesn't make a sense to do it when you have more rows than columns, for instance, where OLS works just fine" Pseudo-inverse, as calculated with SVD, as is MATALAB's pinv, is a numerically stable way to compute OLS, and a perfectly fine thing to do, memory and computation time permitting. x = pinv(A) * b Simple and numerically robust. — Mark L. Stone, Feb 27 '18 at 22:10
I wonder if this idea of generalization is suitable to the minimum norm solution. How does $n$, the number of rows, go to infinity, when the matrix is considered full row rank? For the Tikhonov regularization this is not an issue. — Sextus Empiricus, Feb 27 '18 at 23:12
@MartijnWeterings why is it not an issue when the # of rows goes to infinity for Tikhonov? Why is it a problem here? — Charlie Parker, Feb 28 '18 at 03:14
@MarkL.Stone sorry mark I am not sure what your point is. The question has nothing to do with any real computations, its a totally theoretical question (with limits). Yes I am aware that its the solution to least squares. But thats not the issue, the issue is that there are infinite # of solutions. Many of which you can increase the norm arbitrarily and get an arbitrary bad expected error. Thus, if the solution is actually bonded due to the "minimum normness" then I think its an interesting property. — Charlie Parker, Feb 28 '18 at 03:17
@Pinocchio My 2nd comment was a response to the comment by Aksakai,,and was tangential to your question. — Mark L. Stone, Feb 28 '18 at 03:36
@MarkL.Stone oh yea your right now I see the @ sign addressed at Aksakai. Sorry! — Charlie Parker, Feb 28 '18 at 03:47
@Pinocchio If you let N go to infinity, or already when the number of rows exceeds the number of columns, then Xw=y becomes ill-posed. This is not present in the Tikhonov regularization. — Sextus Empiricus, Feb 28 '18 at 08:27
@MartijnWeterings sorry if its suppose to be obvious, but do u have a reference for that? Not sure what you mean it becomes ill-posed. Its ill-posed from the start. — Charlie Parker, Feb 28 '18 at 18:20
You impose that Xw=y has solutions, this breaks down when N becomes larger than the number of columns. — Sextus Empiricus, Feb 28 '18 at 18:58
@MartijnWeterings that is very interesting! I did not notice that. I guess then the answer for sufficiently large N, then pseudo-inverse is just an unregularized version least squares (fitting the data without bias) so it does NOT regularize as $N$ goes to infinity. I guess that makes sense. I wonder then what type of statement can be made now that you made me realize that the rules of the game have to be changed. Regardless, it still minimum norm, so what type of statistical learning statement can we make? — Charlie Parker, Mar 01 '18 at 14:11
I guess you can still go for stability and consistency. The previous link to a question by amoeba shows a discussion among which it is shown that when rows — Sextus Empiricus, Mar 01 '18 at 15:29
Some interesting concept gets into my mind. **Ridge regression:** 1) constrain the minimal norm with a parameter and 2) find the least squares solution. What about the alternative by switching these two? **Minimal norm regression:** 1) constrain the sum of squared residuals and 2) find the minimum norm. — Sextus Empiricus, Mar 01 '18 at 15:32
@MartijnWeterings you second suggestion sounds extremely similar to Ridge Regression. I'm not sure if its the same as the question I am asking (but still interesting!). What I'm most curious is the overparametrized case |N| — Charlie Parker, Mar 01 '18 at 16:15

Does minimum norm solution guarantee generalization in the underconstrained case (in the statistical learning sense)?

0 Answers0