Generalized Variance in high dimension setting (p>n)

Question

Suppose I have $n$ data vector $X_1, ..., X_n$, where each $X_i$ is a length $p$ random vector.

The sample covariance matrix is $S=\frac{1}{n-1}\sum_i (X_i-\bar{X})(X_i-\bar{X})^T$. If I want to compute the generalized variance, I can compute the determinant of $S$, i.e. $|S|$.

In my actual use case where $p$ is fixed, but $p>n$, is $|S|$ the best estimator that I can use?

In my literature search I come across two types of papers, one that address the case when $p<n$ and talks about several different kinds of better estimators. The other talks about the case when $p$ can grow, but $p(n)<n$, and they talk about asymptotic results. This later case confuses me a bit because in my case, since $p$ is fixed, $n$ will become larger than $p$ in the asymptotics, but in practice I will never be in that case.

Edit

My purpose is to use the generalized variance as summary statistics. I know using a one number summary to describe the variability of multivariate data (functional data actually, in my case) is not a brilliant idea. But that is what is needed for practical purpose. So $\hat{\Sigma}$ (whatever estimator we use) has $p\times p$ many numbers (or $pC2$ many), which is too much. A one number summary like $tr(\hat{\Sigma})$ or $|\hat{\Sigma}|$ it better.

The answer would depend on your goal. You could use any of the eigenvalues of the covariance matrix, or a combination of them. (or equivalently, use [singular values](https://en.wikipedia.org/wiki/Singular_value_decomposition) of the centered/scaled data matrix). The determinant is the product of these (but really you would use their geometric average, to be a "variance", i.e. take n-th root). — GeoMatt22, Apr 12 '17 at 05:44
@GeoMatt22 Some of the singular values are numerically equivalent to zero. What are you supposed to do in this case? — Matthew Gunn, Apr 12 '17 at 07:24
@MatthewGunn I was implicitly considering the reduced SVD, so essentially as in your answer (i.e. "structural" zeros eliminated, but "full rank" = $n$, `svd( ,'econ')` in Matlab; if there are additional *numerical* zeros, these should probably also be eliminated?) — GeoMatt22, Apr 12 '17 at 13:26
Thanks GeoMatt22. In practice I will be using software to compute all these. So for example in `r`, `determinant()` uses LU decomposition (if I have not misunderstood). But my question is not really on how to compute $|S|$, but whether $|S|$ is the best *estimator* to use. For example, some of the papers I cited use some sort of shrinkage estimator $aS+b\Phi$, or some bias corrected form like $S+\tau$. I am not saying these are better (or not), my question is whether $|S|$ is a good estimator. — qoheleth, Apr 12 '17 at 23:49

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

The problem with $p \geq n$

If $p \geq n$ then as you recognize your covariance matrix is necessarily rank deficient.

Consider $p=2$ case. If $n=3$ you can find the ellipse defined by the estimated covariance matrix for the three points.

But if $n=2$, then what is the ellipse supposed to be?! The rank deficient covariance matrix defines the degenerate ellipse that's a line segment between the two points. For $p \geq n$, the usual sample covariance has some dimension with zero variation (in this case, basically the $(1,-1)$ vector). Does that mean there's actually zero variation on that dimension, that your data only occupies a one dimensional space? (Or if we drew a third blue point, could it fall within one of the larger ellipses?)

If you're a Bayesian, what's your prior on the covariance matrix? A direction perhaps to go is to add some kind of prior or regularization? (I'm admittedly not an expert in this area. I'd love to see other answers.)

And I think we'll have to know more about your problem and why you want the determinant to have an idea of how to proceed? Perhaps you want something like the pseudo-determinant?

Generalized Variance in high dimension setting (p>n)

1 Answers1

The problem with $p \geq n$

Linked