How does addition of a regularization term ensures that the matrix is nonsingular? ( least squares )

Question

In Bishop's Pattern recognition book, in 3.1.2 Geometry of least squares section (page 143, last paragraph of section), it is stated that:

In practice, a direct solution of the normal equations can lead to numerical difficulties when ΦTΦ is close to singular. In particular, when two or more of the basis vectors ϕj are co-linear, or nearly so, the resulting parameter values can have large magnitudes. Such near degeneracies will not be uncommon when dealing with real data sets. The resulting numerical difficulties can be addressed using the technique of singular value decomposition, or SVD (Press et al., 1992; Bishop and Nabney, 2008). Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.

I did not understand how adding a regularization term can prevent Φ^TΦ matrix to be singular. This is just a sum (λI+Φ^TΦ instead of Φ^TΦ (I: identity matrix))

Particularly, this part is confusing me:

Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.

Pdf of book is available here: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

Here's a sketch: Any $M=\Phi^T \Phi$ (for real $\Phi$) must be positive *semi*-definite (sums of squares must be non-negative). Adding $\lambda I$ (for $\lambda > 0$) guarantees that the singular values must be positive (sum of positive number and non-negative number must be positive). You can demonstrate this by writing down the SVD of $\Phi$ and working through the algebra of $M + \lambda I$. — Sycorax, Jan 07 '22 at 15:40
@Sycorax That machinery is overkill. In the context $A=\Phi^\prime T\Phi$ *must* be square and positive semi-definite. (Otherwise the conclusion is incorrect.) Now, $\lambda+A$ is singular for a positive real number $\lambda$ if and only if there exists nonzero $x$ for which $(\lambda + A)x=0$ (that's the definition of "singular"). That's obviously equivalent to the eigenvalue equation $Ax=-\lambda x,$ yet we know (by supposition) that $A$ has no negative eigenvalues, *QED.* — whuber, Jan 07 '22 at 18:26
https://stats.stackexchange.com/questions/69205/ is closely related: it discusses Ridge Regression, which is an example of this form of regularization. — whuber, Jan 07 '22 at 18:59
How could I not think of that, of course, it makes sense! Thank you so much for your explanations @Sycorax — Mas A, Jan 07 '22 at 20:46
Thank you @whuber for your comments and the url that you shared! — Mas A, Jan 07 '22 at 20:48

Sycorax · Answer 1 · 2022-01-10T20:31:44.650

This demonstration starts from the definition of positive (semi) definite. We can use the definition of the $L^2$ norm to show that the quadratic form must be non-negative.

$$ x^T \Phi^T \Phi x = \| \Phi x \|_2^2 \ge 0 $$

The intuition here is that the shortest length the vector $\Phi x$ can have is 0, which corresponds to standard understandings about lengths. Or, generically, we're considering the sums of squares (that's what the $L^2$ norm is) of real numbers, so it must be non-negative.

So now we know that $\Phi^T \Phi$ is positive semi-definite. To answer the question, we need to further show that $M = \Phi^T \Phi + \lambda I$ is nonsingular for $\lambda > 0$.

We can use the same strategy again:

$$\begin{align} x^T \left( \Phi^T\Phi + \lambda I\right) x &= x^T \Phi^T\Phi x + \lambda x^T x \\ &= \| \Phi x \|_2^2 + \lambda \| x \|_2^2 \gt 0 \end{align}$$

We know that this quadratic form must be positive because the length of any nonzero vector $x$ is positive. The sum of a non-negative and a positive number is positive.

Therefore, the regularized matrix $M$ satisfies the definition of positive definite. A matrix that is positive definite has all positive eigenvalues, which means that 0 is not an eigenvalue. By the invertible matrix theorem, $M $ is nonsingular.

This demonstration shows why $M$ is nonsingular for any $\lambda >0$. In other words, the choice of $\lambda $ does not depend on $\Phi$. We know that this is the case because we've shown that all eigenvalues of $\Phi^T \Phi $ are non-negative, which is more specific than showing that all eigenvalues are real.

Eric Johnson · Accepted Answer · 2022-01-12T06:09:14.093

3

Given the formulation:

$$ \frac{1}{2} {\left\| A x - b \right\|}_{2}^{2} + \frac{\lambda}{2} {\left\| x \right\|}_{2}^{2} $$

The optimal solution is given by $ \hat{x} = {\left( {A}^{T} A + \lambda I \right)}^{-1} {A}^{T} b $.

Since $ {A}^{T} A $ is symmetric all of its eigenvectors are real.
It is also Positive Semi Definite, hence all its eigenvalues are non negative. Hence for $ \lambda > 0 $ the matrix $ {A}^{T} A + \lambda I $ must have all positive eigenvalues which means it is invertible (Stable) since it is a Positive Definite Matrix.

edited Jan 12 '22 at 06:09

answered Jan 10 '22 at 07:55

Eric Johnson

317
6

2

The point is that "large enough" is *any positive value whatsoever.* The information you are neglecting is that $A^\prime A$ obviously has no negative eigenvalues. – whuber Jan 10 '22 at 19:09

Aksakal · Answer 3 · 2022-01-11T01:21:49.743

1

Without getting into linear algebra, the intuition is very simple, actually. The difference between $$(\Phi'\Phi+\lambda I)^{-1}\\(\Phi'\Phi)^{-1}$$ is like like the difference between $$\frac 1 {x^2+\lambda}\\\frac 1 {x^2}$$ where $x\in\mathbb R$: $$x^2\in[0,\infty)\\x^2+\lambda\in[\lambda,\infty)$$ Hence, the regularized version doesn't have a singularity at $x=0$ when $\lambda\gt 0$

edited Jan 11 '22 at 01:21

answered Jan 10 '22 at 21:20

Aksakal

55,939
5
90
176

you obviously have to do linear algebra, but the analogy, again, is simple: it corresponds to $\frac 1 {x^2+\lambda}$. When $\lambda>0$, there's no singularity. – Aksakal Jan 10 '22 at 21:36
1

By using suitable definitions we can avoid linear algebra altogether (or at least keep it at the high school level). For any square matrix $A$ (symmetric or not) if there is a number $\lambda\gt 0$ for which $A+\lambda$ is singular, then--by definition--there would exist a nonzero vector $x$ for which $(A+\lambda)x=0.$ Upon left-multiplying both sides by $x^\prime$ we would conclude $x^\prime Ax=-\lambda|x|^2\lt0.$ That is impossible, because in this context $x^\prime Ax$ represents a sum of squares, which cannot be negative. – whuber Jan 10 '22 at 21:48

How does addition of a regularization term ensures that the matrix is nonsingular? ( least squares )

3 Answers3