17

I've been reviewing Gaussian Processes and, from what I can tell, there's some debate whether the "covariance matrix" (returned by the kernel), which needs to be inverted, should be done so through matrix inversion (expensive and numerically unstable) or via Cholesky decomposition.

I'm quite new to Cholesky decomposition and I've come to understand that it's akin to square roots for scalars. Likewise, the inverse of a matrix is akin to division by a scalar (ex when you multiply $A * A^{-1} = I$ the identity matrix is returned, which resembles $5/5 = 1$.)

I'm struggling to make the connection- what's the relationship between Cholesky decomposition of a covariance matrix and the inverse of the covariance matrix? Are additional steps required to cement the equivalence of solutions?

jbuddy_13
  • 1,578
  • 3
  • 22
  • 2
    Are you asking about how to work through the algebra to show equivalence between using an inverse directly and using a Cholesky factor, or are you asking something else? – Sycorax Dec 31 '20 at 20:16
  • 1
    @Sycorax- I like how you phrased it: "How to work through the algebra to show equivalence between using an inverse directly and using a Cholesky factor" as my write-up lacked this eloquence/simplicity. Yes; I'd like to see how this relationship works out. – jbuddy_13 Dec 31 '20 at 20:19
  • 3
    In any numerical algorithm, a statement that some matrix "needs" to be inverted is almost always *wrong*. (I.e. it is wrong at least 99.99% of the time). See Sycorax's answer for why it is wrong here. And note, there are often better numerical methods that don't even involve explicitly calculating the covariance matrix at all, let alone factoring it. – alephzero Jan 01 '21 at 19:34
  • @alephzero, perhaps my wording was poor. To arrive at the desired solution, matrix inversion must either by accomplished by standard methods or sidestepped through clever alternatives as noted in the accepted answer (which uses Cholesky decomposition.) – jbuddy_13 Jan 01 '21 at 20:33
  • 1
    *there's some debate whether ... should be done so through matrix inversion (expensive and numerically unstable) or via Cholesky decomposition.* - Is there really a "debate"? I suspect the issue is exposition - across statistics generally, when explaining a theory, textbooks present formulas including inverses; look at what's done in practice, and matrices are near-as-never inverted, as @alephzero says. Since more people see the Q than the A (eg via google snippets) maybe the Q should be edited so as not to suggest there's any "controversy" about avoiding explicit computation of an inverse? – Silverfish Jan 02 '21 at 00:49
  • 1
    @Silverfish The way I see it, part of the issue is `inverse(A)` returning an array full of numbers in many languages, rather than a proxy object that applies the correct algorithms, like Matlab's [`decomposition`](https://www.mathworks.com/help/matlab/ref/decomposition.html) or Julia's [`factorize`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.factorize). Then one would be able to write `inverse(A)*b` straight from the textbook without any performance/accuracy penalty. – Federico Poloni Jan 02 '21 at 13:41

1 Answers1

25

Gaussian process models often involve computing some quadratic form, such as $$ y = x^\top\Sigma^{-1}x $$ where $\Sigma$ is positive definite, $x$ is a vector of appropriate dimension, and we wish to compute scalar $y$. Typically, you don't want to compute $\Sigma^{-1}$ directly because of cost or loss of precision. Using a definition of Cholesky factor $L$, we know $\Sigma=LL^\top$. Because $\Sigma$ is PD, the diagonals of $L$ are also positive, which implies $L$ is non-singular. In this exposition, $L$ is lower-triangular.

We can rewrite $$\begin{align} y &= x^\top(LL^\top)^{-1}x \\ &= x^\top L^{-\top}L^{-1}x \\ &= (L^{-1}x)^\top L^{-1}x \\ &= z^\top z \end{align} $$

The first to second line is an elementary property of a matrix inverse. The second to third line just rearranges the transpose. The final line re-writes it as an expression of vector dot-products, which is convenient because we only need to compute $z$ once.

The nice thing about triangular matrices is that they're dirt-simple to solve, so we don't actually ever compute $L^{-1}x$; instead, we just use forward substitution for $Lz=x$, which is very cheap compared to computing an inverse directly.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/117975/discussion-on-answer-by-sycorax-relationship-between-cholesky-decomposition-and). – Sycorax Jan 03 '21 at 11:47