2

In Bishop's Pattern recognition book, in 3.1.2 Geometry of least squares section (page 143, last paragraph of section), it is stated that:

In practice, a direct solution of the normal equations can lead to numerical difficulties when ΦTΦ is close to singular. In particular, when two or more of the basis vectors ϕj are co-linear, or nearly so, the resulting parameter values can have large magnitudes. Such near degeneracies will not be uncommon when dealing with real data sets. The resulting numerical difficulties can be addressed using the technique of singular value decomposition, or SVD (Press et al., 1992; Bishop and Nabney, 2008). Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.

I did not understand how adding a regularization term can prevent Φ^TΦ matrix to be singular. This is just a sum (λI+Φ^TΦ instead of Φ^TΦ (I: identity matrix))

Particularly, this part is confusing me:

Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.

Pdf of book is available here: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

Mas A
  • 175
  • 9
  • 1
    Here's a sketch: Any $M=\Phi^T \Phi$ (for real $\Phi$) must be positive *semi*-definite (sums of squares must be non-negative). Adding $\lambda I$ (for $\lambda > 0$) guarantees that the singular values must be positive (sum of positive number and non-negative number must be positive). You can demonstrate this by writing down the SVD of $\Phi$ and working through the algebra of $M + \lambda I$. – Sycorax Jan 07 '22 at 15:40
  • 2
    @Sycorax That machinery is overkill. In the context $A=\Phi^\prime T\Phi$ *must* be square and positive semi-definite. (Otherwise the conclusion is incorrect.) Now, $\lambda+A$ is singular for a positive real number $\lambda$ if and only if there exists nonzero $x$ for which $(\lambda + A)x=0$ (that's the definition of "singular"). That's obviously equivalent to the eigenvalue equation $Ax=-\lambda x,$ yet we know (by supposition) that $A$ has no negative eigenvalues, *QED.* – whuber Jan 07 '22 at 18:26
  • @whuber That’s a much better demonstration! – Sycorax Jan 07 '22 at 18:31
  • 1
    https://stats.stackexchange.com/questions/69205/ is closely related: it discusses Ridge Regression, which is an example of this form of regularization. – whuber Jan 07 '22 at 18:59
  • How could I not think of that, of course, it makes sense! Thank you so much for your explanations @Sycorax – Mas A Jan 07 '22 at 20:46
  • Thank you @whuber for your comments and the url that you shared! – Mas A Jan 07 '22 at 20:48

3 Answers3

6

This demonstration starts from the definition of positive (semi) definite. We can use the definition of the $L^2$ norm to show that the quadratic form must be non-negative.

$$ x^T \Phi^T \Phi x = \| \Phi x \|_2^2 \ge 0 $$

The intuition here is that the shortest length the vector $\Phi x$ can have is 0, which corresponds to standard understandings about lengths. Or, generically, we're considering the sums of squares (that's what the $L^2$ norm is) of real numbers, so it must be non-negative.

So now we know that $\Phi^T \Phi$ is positive semi-definite. To answer the question, we need to further show that $M = \Phi^T \Phi + \lambda I$ is nonsingular for $\lambda > 0$.

We can use the same strategy again:

$$\begin{align} x^T \left( \Phi^T\Phi + \lambda I\right) x &= x^T \Phi^T\Phi x + \lambda x^T x \\ &= \| \Phi x \|_2^2 + \lambda \| x \|_2^2 \gt 0 \end{align}$$

We know that this quadratic form must be positive because the length of any nonzero vector $x$ is positive. The sum of a non-negative and a positive number is positive.

Therefore, the regularized matrix $M$ satisfies the definition of positive definite. A matrix that is positive definite has all positive eigenvalues, which means that 0 is not an eigenvalue. By the invertible matrix theorem, $M $ is nonsingular.


This demonstration shows why $M$ is nonsingular for any $\lambda >0$. In other words, the choice of $\lambda $ does not depend on $\Phi$. We know that this is the case because we've shown that all eigenvalues of $\Phi^T \Phi $ are non-negative, which is more specific than showing that all eigenvalues are real.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
3

Given the formulation:

$$ \frac{1}{2} {\left\| A x - b \right\|}_{2}^{2} + \frac{\lambda}{2} {\left\| x \right\|}_{2}^{2} $$

The optimal solution is given by $ \hat{x} = {\left( {A}^{T} A + \lambda I \right)}^{-1} {A}^{T} b $.

Since $ {A}^{T} A $ is symmetric all of its eigenvectors are real.
It is also Positive Semi Definite, hence all its eigenvalues are non negative. Hence for $ \lambda > 0 $ the matrix $ {A}^{T} A + \lambda I $ must have all positive eigenvalues which means it is invertible (Stable) since it is a Positive Definite Matrix.

Eric Johnson
  • 317
  • 6
  • 2
    The point is that "large enough" is *any positive value whatsoever.* The information you are neglecting is that $A^\prime A$ obviously has no negative eigenvalues. – whuber Jan 10 '22 at 19:09
1

Without getting into linear algebra, the intuition is very simple, actually. The difference between $$(\Phi'\Phi+\lambda I)^{-1}\\(\Phi'\Phi)^{-1}$$ is like like the difference between $$\frac 1 {x^2+\lambda}\\\frac 1 {x^2}$$ where $x\in\mathbb R$: $$x^2\in[0,\infty)\\x^2+\lambda\in[\lambda,\infty)$$ Hence, the regularized version doesn't have a singularity at $x=0$ when $\lambda\gt 0$

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • you obviously have to do linear algebra, but the analogy, again, is simple: it corresponds to $\frac 1 {x^2+\lambda}$. When $\lambda>0$, there's no singularity. – Aksakal Jan 10 '22 at 21:36
  • 1
    By using suitable definitions we can avoid linear algebra altogether (or at least keep it at the high school level). For any square matrix $A$ (symmetric or not) if there is a number $\lambda\gt 0$ for which $A+\lambda$ is singular, then--by definition--there would exist a nonzero vector $x$ for which $(A+\lambda)x=0.$ Upon left-multiplying both sides by $x^\prime$ we would conclude $x^\prime Ax=-\lambda|x|^2\lt0.$ That is impossible, because in this context $x^\prime Ax$ represents a sum of squares, which cannot be negative. – whuber Jan 10 '22 at 21:48