4

I have two samples that partially overlap on the variables they describe. The samples are taken from more or less the same population, and show similar values on the overlapping variables.

Based on this i can pool the descriptive statistics in one large covariance matrix.

To be more specific: From Sample 1 the following statistics were reported:
Sigma1: {A, B, C}x{A, B, C} (The variance covariance matrix for variables A, B, C)
Sigma2: {A, B, C}x{D, E} (The covariance matrix between A, B, C and D, E)
From Sample 2 i have:
Sigma3: {D, E}x{D, E}

The total covariance matrix then becomes:
Sigma1 Sigma2 Sigma2^T Sigma3

Do i have any guarantee that this will be a valid covariance matrix? Obviously i can check by diagonalising, and asserting all eigenvalues are positive. Is this always the case?

Another way to pose the question would be: if i have a positive definite matrix, and i replace a lower right (or upper left) square with another positive definite matrix, which in addition also shows only a small difference \delta on each entry with the original entries, how does this affect the eigenvalues of the matrix?

Ivana
  • 552
  • 2
  • 12

1 Answers1

1

No, there is no guarantee. (This is a common problem.)

Consider this tiny dataset of three variables $A$, $B$, and $C$:

 A  B  C
 1  1  1
 0  0 NA
 0 NA  2
NA  0  0

There is partial overlap, as the first record demonstrates. The sample covariance matrix (computed using all pairwise complete observations for both entries) is

$$\frac{1}{6}\pmatrix{2 & 3 & -3 \\ 3 & 2 & 3 \\ -3 & 3 & 6}.$$

Its eigenvalues are $4/3, 5/6, -1/2$. Because their signs vary, this matrix is not definite.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • I suppose there exists an example without `NA` values, or are they necessary? I think including `NA`s brings in some unnecessary confusion. – Richard Hardy Apr 16 '15 at 17:20
  • @Richard I don't follow: the entire question concerns the situation where there *are* missing values. If no values are missing, then necessarily the sample covariance matrix is positive-semidefinite. – whuber Apr 16 '15 at 17:22
  • Sorry, I did not read the question carefully enough. You must be right. – Richard Hardy Apr 16 '15 at 18:07
  • I'm sorry i did not mean missing values, my question was very imprecise, and is fixed now. – Ivana Apr 17 '15 at 10:35
  • 1
    Actually, you *did* mean missing values: my interpretation of your question is consistent with how you rephrased it. Simply concatenate all three samples into one database and place NA values wherever observations are not available. – whuber Apr 17 '15 at 15:36
  • @whuber ah, i see what you mean. – Ivana Apr 19 '15 at 21:10