2

I would like to multiply two correlated random variables, but I'm getting a negative variance. Please point out where I'm wrong.

Variable1 and Variable2 are projected onto data2 from a model trained on data1. The equation I am trying to use to calculate the variance of their products is from Variance of product of dependent variables. But when I translate it into R I get:

> data2$ProductVariance <- cov(data1$Variable2^2, data1$Variable1^2, use = "na.or.complete") + (data2$Variable2.se^2 + data2$Variable2^2)*(data2$Variable1.se^2 + data2$Variable1^2) - (cov(data1$Variable2, data1$Variable1, use = "na.or.complete") + data2$Variable2*data2$Variable1)^2
> data2[c(1, 50, 100, 150, 200), c("Variable1", "Variable1.se", "Variable2", "Variable2.se", "ProductVariance")]
    Variable1 Variable1.se  Variable2 Variable2.se ProductVariance
2   102.32145    0.4053362 0.68919721  0.006114099       -3.649645
60  103.55957    0.5298381 0.66190120  0.006577132       -3.528850
120  99.81072    0.4176735 0.07347951  0.005274879       -3.735716
183 100.59532    0.4008085 0.62787019  0.005122185       -3.777981
246 102.27556    0.3762328 0.73455316  0.006578060       -3.597734

Their covariance and squared-covariance are:

> cov(data1$Variable2, data1$Variable1, use = "na.or.complete")
[1] 0.0008326011
> cov(data1$Variable2^2, data1$Variable1^2, use = "na.or.complete")
[1] -4.00164

And, broken out by term, it looks like the negatives are coming from the 1st and 3rd term.

> data2$FirstTerm <- cov(data1$Variable2^2, data1$Variable1^2, use = "na.or.complete")
> data2$SecondTerm <- (data2$Variable2.se^2 + data2$Variable2^2)*(data2$Variable1.se^2 + data2$Variable1^2)
> data2$ThirdTerm <-  -(cov(data1$Variable2, data1$Variable1, use = "na.or.complete") + data2$Variable2*data2$Variable1)^2
> data2[c(1, 50, 100, 150, 200), c("Variable1", "Variable1.se", "Variable2", "Variable2.se", "ProductVariance", "FirstTerm", "SecondTerm", "ThirdTerm")]
    Variable1 Variable1.se  Variable2 Variable2.se ProductVariance FirstTerm SecondTerm  ThirdTerm
2   102.32145    0.4053362 0.68919721  0.006114099       -3.649645  -4.00164 4973.49137 -4973.1394
60  103.55957    0.5298381 0.66190120  0.006577132       -3.528850  -4.00164 4699.16893 -4698.6961
120  99.81072    0.4176735 0.07347951  0.005274879       -3.735716  -4.00164   54.06632   -53.8004
183 100.59532    0.4008085 0.62787019  0.005122185       -3.777981  -4.00164 3989.61583 -3989.3922
246 102.27556    0.3762328 0.73455316  0.006578060       -3.597734  -4.00164 5644.57031 -5644.1664

Have I misunderstood the formula?

EDIT

Should

> cov(data1$Variable2^2, data1$Variable1^2, use = "na.or.complete")
[1] -4.00164

be positive?

Variable1.se X ProductVariance

Variable2.se X ProductVariance

  • I think one think worth noting is that the formula you link to calls for $E(X)$ and $E(Y)$, whereas you use $x_i$ and $y_i$ for each individual $i$. That is, when you're looking for the variance of the product of two (correlated) random variables, you want to find one number to describe the whole sample - but you end up with a different number for the second and third terms for each entry precisely because you're calling the individual entries of $x_i$ and $y_i$, i.e. `data1$Variable1`, instead of $E(X)$ or $E(Y)$, i.e. `mean(data1$Variable1)`. – greggs Jan 08 '21 at 10:04

1 Answers1

2

Expanding on my earlier comment, here's the result of just switching out your uses of $x_i$ and $y_i$ for $E(X)$ and $E(Y)$:

# Set up x and y
x <- c(102.32145, 103.55957, 99.81072, 100.59532, 102.27556)
y <- c(0.68919721, 0.66190120, 0.07347951, 0.62787019, 0.73455316)
# Put them into a dataframe and take their squares
df <- data.frame(x, y)
df$x.sqr <- (df$x)^2
df$y.sqr <- (df$y)^2
# Find the variances of x and y
x.var <- var(df$x)
y.var <- var(df$y)
# Find the expected values of x and y
x.exp <- mean(df$x)
y.exp <- mean(df$y)

# 1st term, find cov(x.sqr, y.sqr)
first.term <- cov(df$x.sqr, df$y.sqr)
# first.term = 48.1506

# 2nd term, find [V(x) + E(x)^2]*[V(y) + E(y)^2]
a <- x.var + (x.exp)^2
b <- y.var + (y.exp)^2
second.term <- a*b
# second.term = 3987.9941

# 3rd term, find [cov(x,y) + E(x)E(y)]^2
c <- cov(df$x, df$y)
d <- x.exp * y.exp
third.term <- (c + d)^2
# third.term = 3248.7993

# All together now!
var.product <- first.term + second.term - third.term
# var.product = 787.3454

And the variance is positive! Obviously in this case the variance is very high - I've calculated it using just the five rows of data you display above (i.e. $n = 5$). I would imagine when you compute it for your whole sample size with a much larger value of $n$ the variance will be much more reasonable.

greggs
  • 190
  • 8