Recall that pseudo-inverse can be characterized as follows:
Solve $$ \| w \|^2 $$
subject to:
$$ Xw = y $$
thus it is plausible since its a constrained optimization problem that the solution generalizes (as the traditional wisdom of statistical learning theory). If that is the case does it generalize in the following sense (empirical risk converges to expected risk):
$$ \lim_{ N \rightarrow \infty} \hat E_{S_N}[Loss(f_{w^+}(x),y) ] - E_{p(x,y)}[Loss(f_{w^+}(x),y) ] = 0$$
where $f_{w^+}(x) = \langle w^+, x \rangle$, $S_{N} = \{ (x_i,y_i) \}^N_{i=1}$ and $\hat E_{S_N}[Loss(f(x),y)] = \frac{1}{N} \sum^N_{i=1} Loss(f(x),y) $.
My thoughts:
Assume we find the predictor/hypothesis we want via empirical risk minimization (by training on the training set):
$$ \| Xw - y \|^2 $$
and the solution we find is the pseudo-inverse and thus its minimum norm i.e. $w^+ = X^+ y$ (assume full row rank and that the system is under-constrained/overparametrized). If we solved a similar problem but we used Tikhonov Regularization (we would have generalization via stability). In that case we are solving:
$$ \| Xw - y \|^2 + \lambda \| w\|^2 $$
so assuming X is full row rank then of course the first term $\| Xw - y \|^2 = 0$ because we can find a linear combination that makes it zero. Now we need to find a solution that minimizes $\|w\|^2$ the L2 norm of the solution. My confusion is if Tikhonov regularization will find the same minimum norm solution as found by the pseudo-inverse. Since both minimize the squared distance to the true signal $y$ it should be clear that its possible to make that zero. However, its not clear if they coincide in solution. My guess is that Tikhonov does not coincide with the pseudo-inverse solution unless $\lambda = 0$. If that is the case then its not clear at all if the stability argument goes through to the psueod-inverse solution and therefore its unless if it has any generalization guarantees.
Tikhonov vs Pseudo-inverse solutions
Looking at the two equations for $w$ it seems to me that Tikhonov is different from Pseudo-Inverse:
$$ (X^TX)^{+}X^Ty = w_+$$
vs tikhonov:
$$ (X^TX + \lambda I )^{-1}X^Ty = w_{\lambda} $$
Since they don't look equal I'd assume that pseudo-inverse might not generalize.
Note generalization in this question is defined as train and test/expect risk converging to the same value as # data goes to infinity.