62

I'm curious about the nature of $\Sigma^{-1}$. Can anybody tell something intuitive about "What does $\Sigma^{-1}$ say about data?"

Edit:

Thanks for replies

After taking some great courses, I'd like to add some points:

  1. It is measure of information, i.e., $x^T\Sigma^{-1}x$ is amount of info along the direction $x$.
  2. Duality: Since $\Sigma$ is positive definite, so is $\Sigma^{-1}$, so they are dot-product norms, more precisely they are dual norms of each other, so we can derive Fenchel dual for the regularized least squares problem, and do maximization w.r.t dual problem. We can choose either of them, depending on their conditioning.
  3. Hilbert space: Columns (and rows) of $\Sigma^{-1}$ and $\Sigma$ span the same space. So there is not any advantage (other that when one of these matrices is ill-conditioned) between representation with $\Sigma^{-1}$ or $\Sigma$
  4. Bayesian Statistics: norm of $\Sigma^{-1}$ plays an important role in the Bayesian statistics. I.e. it determined how much information we have in prior, e.g., when covariance of the prior density is like $\|\Sigma^{-1}\|\rightarrow 0 $ we have non-informative (or probably Jeffreys prior)
  5. Frequentist Statistics: It is closely related to Fisher information, using the Cramér–Rao bound. In fact, fisher information matrix (outer product of gradient of log-likelihood with itself) is Cramér–Rao bound it, i.e. $\Sigma^{-1}\preceq \mathcal{F}$ (w.r.t positive semi-definite cone, i.e. w.r.t. concentration ellipsoids). So when $\Sigma^{-1}=\mathcal{F}$ the maximum likelihood estimator is efficient, i.e. maximum information exist in the data, so frequentist regime is optimal. In simpler words, for some likelihood functions (note that functional form of the likelihood purely depend on the probablistic model which supposedly generated data, aka generative model), maximum likelihood is efficient and consistent estimator, rules like a boss. (sorry for overkilling it)
Arya
  • 873
  • 1
  • 7
  • 8
  • 3
    I think PCA picks up eigenvector with large eigenvalues rather than small eigenvalues. – wdg Dec 03 '14 at 13:43
  • 2
    (3) Is incorrect, because it is tantamount to asserting the columns of $\Sigma^{-1}$ are those of $\Sigma$ (up to a permutation), which is true only for the identity matrix. – whuber May 15 '15 at 17:37

2 Answers2

20

It is a measure of precision just as $\Sigma$ is a measure of dispersion.

More elaborately, $\Sigma$ is a measure of how the variables are dispersed around the mean (the diagonal elements) and how they co-vary with other variables (the off-diagonal) elements. The more the dispersion the farther apart they are from the mean and the more they co-vary (in absolute value) with the other variables the stronger is the tendency for them to 'move together' (in the same or opposite direction depending on the sign of the covariance).

Similarly, $\Sigma^{-1}$ is a measure of how tightly clustered the variables are around the mean (the diagonal elements) and the extent to which they do not co-vary with the other variables (the off-diagonal elements). Thus, the higher the diagonal element, the tighter the variable is clustered around the mean. The interpretation of the off-diagonal elements is more subtle and I refer you to the other answers for that interpretation.

prop
  • 409
  • 3
  • 8
  • 3
    A strong counter-example to your last statement about off-diagonal elements in $\Sigma^{-1}$ is afforded by the simplest nontrivial example in two dimensions, $\Sigma^{-1}=\left( \begin{array}{cc} \frac{1}{1-\rho ^2} & -\frac{\rho }{1-\rho ^2} \\ -\frac{\rho }{1-\rho ^2} & \frac{1}{1-\rho ^2} \\ \end{array} \right).$ The larger off-diagonal values correspond to *more* extreme values of the correlation coefficient $\rho,$ which is the opposite of what you appear to be saying. – whuber Oct 22 '13 at 14:27
  • @whuber Right. I should get rid of the 'absolute' word in the last sentence. Thanks – prop Oct 22 '13 at 14:39
  • 3
    Thanks, but that still doesn't cure the problem: the relationship you assert between the off-diagonal elements of the inverse and the co-variation does not exist. – whuber Oct 22 '13 at 16:43
  • @whuber I think it does. In your example, the off-diagonal elements are negative. Therefore, as $\rho$ increases the off-diagonal elements decrease. You can check this by noting the following: at $\rho = 0$ the off-diagonal element is $0$; as $\rho$ approaches $1$ the off-diagonal elements approach $-\infty$ and the derivative of the off-diagonal element with respect to $\rho$ is negative. – prop Oct 22 '13 at 17:03
  • 2
    My off-diagonal elements are positive when $\rho\lt 0.$ – whuber Oct 22 '13 at 19:16
  • I don't see the contradiction regarding the off-diagonal elements when ρ<0. As ρ approaches -1, X and Y become more anit-correlated, and the precision value approaches ∞, meaning they are becoming more NOT co-varying, am I right? – Jason May 23 '18 at 06:13
  • Ok so might be a silly question, Is dispersion (how far population is from it mean) and covary (how one population increases as other increases) two "conceptually" different things encoded by the "same formula"? In other-words in my mind if I think about "dispersion" it is completely different from "covariance", so are the two different or same? – GENIVI-LEARNER Mar 31 '20 at 18:27
18

Using superscripts to denote the elements of the inverse, $1/\sigma^{ii}$ is the variance of the component of variable $i$ that is uncorrelated with the $p-1$ other variables, and $-\sigma^{ij}/\sqrt{\sigma^{ii}\sigma^{jj}}$ is the partial correlation of variables $i$ and $j$, controlling for the $p-2$ other variables.

Ray Koopman
  • 2,143
  • 12
  • 6